DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines

Patrick Damme; Matthias Boehm; Mark Dokter; Kevin Innerebner; Roman Kern

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines

Patrick Damme, Matthias Boehm, Mark Dokter, Kevin Innerebner, Roman Kern

Publikation: Konferenzbeitrag › Paper › Begutachtung

Abstract

Integrated data analysis (IDA) pipelines---that combine data management (DM) and query processing, high-performance computing (HPC), and machine learning (ML) training and scoring---become increasingly common in practice. Interestingly, systems of these areas share many compilation and runtime techniques, and the used---increasingly heterogeneous---hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource management, data formats and representations, as well as execution strategies differ substantially. DAPHNE is an open and extensible system infrastructure for such IDA pipelines, including language abstractions, compilation and runtime techniques, multi-level scheduling, hardware (HW) accelerators, and computational storage for increasing productivity and eliminating unnecessary overheads. In this paper, we make a case for IDA pipelines, describe the overall DAPHNE system architecture, its key components, and the design of a vectorized execution engine for computational storage, HW accelerators, as well as local and distributed operations. Preliminary experiments that compare DAPHNE with MonetDB, Pandas, DuckDB, and TensorFlow show promising results

Originalsprache	englisch
Seitenumfang	12
Publikationsstatus	Veröffentlicht - 2022
Veranstaltung	12th Conference on Innovative Data Systems Research: CIDR 2022 - Hybrider Event, USA / Vereinigte Staaten Dauer: 9 Jan. 2022 → 12 Jan. 2022

Konferenz

Konferenz	12th Conference on Innovative Data Systems Research
Kurztitel	CIDR 2022
Land/Gebiet	USA / Vereinigte Staaten
Ort	Hybrider Event
Zeitraum	9/01/22 → 12/01/22

Zugriff auf Dokument

https://www.cidrdb.org/cidr2022/papers/p4-damme.pdfLizenz: CC BY 4.0

Dieses zitieren

@conference{6746d882fe3348e4891d3eed3c010275,

title = "DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines",

abstract = "Integrated data analysis (IDA) pipelines---that combine data management (DM) and query processing, high-performance computing (HPC), and machine learning (ML) training and scoring---become increasingly common in practice. Interestingly, systems of these areas share many compilation and runtime techniques, and the used---increasingly heterogeneous---hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource management, data formats and representations, as well as execution strategies differ substantially. DAPHNE is an open and extensible system infrastructure for such IDA pipelines, including language abstractions, compilation and runtime techniques, multi-level scheduling, hardware (HW) accelerators, and computational storage for increasing productivity and eliminating unnecessary overheads. In this paper, we make a case for IDA pipelines, describe the overall DAPHNE system architecture, its key components, and the design of a vectorized execution engine for computational storage, HW accelerators, as well as local and distributed operations. Preliminary experiments that compare DAPHNE with MonetDB, Pandas, DuckDB, and TensorFlow show promising results",

author = "Patrick Damme and Matthias Boehm and Mark Dokter and Kevin Innerebner and Roman Kern",

year = "2022",

language = "English",

note = "12th Conference on Innovative Data Systems Research : CIDR 2022, CIDR 2022 ; Conference date: 09-01-2022 Through 12-01-2022",

}

TY - CONF

T1 - DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines

AU - Damme, Patrick

AU - Boehm, Matthias

AU - Dokter, Mark

AU - Innerebner, Kevin

AU - Kern, Roman

PY - 2022

Y1 - 2022

N2 - Integrated data analysis (IDA) pipelines---that combine data management (DM) and query processing, high-performance computing (HPC), and machine learning (ML) training and scoring---become increasingly common in practice. Interestingly, systems of these areas share many compilation and runtime techniques, and the used---increasingly heterogeneous---hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource management, data formats and representations, as well as execution strategies differ substantially. DAPHNE is an open and extensible system infrastructure for such IDA pipelines, including language abstractions, compilation and runtime techniques, multi-level scheduling, hardware (HW) accelerators, and computational storage for increasing productivity and eliminating unnecessary overheads. In this paper, we make a case for IDA pipelines, describe the overall DAPHNE system architecture, its key components, and the design of a vectorized execution engine for computational storage, HW accelerators, as well as local and distributed operations. Preliminary experiments that compare DAPHNE with MonetDB, Pandas, DuckDB, and TensorFlow show promising results

AB - Integrated data analysis (IDA) pipelines---that combine data management (DM) and query processing, high-performance computing (HPC), and machine learning (ML) training and scoring---become increasingly common in practice. Interestingly, systems of these areas share many compilation and runtime techniques, and the used---increasingly heterogeneous---hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource management, data formats and representations, as well as execution strategies differ substantially. DAPHNE is an open and extensible system infrastructure for such IDA pipelines, including language abstractions, compilation and runtime techniques, multi-level scheduling, hardware (HW) accelerators, and computational storage for increasing productivity and eliminating unnecessary overheads. In this paper, we make a case for IDA pipelines, describe the overall DAPHNE system architecture, its key components, and the design of a vectorized execution engine for computational storage, HW accelerators, as well as local and distributed operations. Preliminary experiments that compare DAPHNE with MonetDB, Pandas, DuckDB, and TensorFlow show promising results

M3 - Paper

T2 - 12th Conference on Innovative Data Systems Research

Y2 - 9 January 2022 through 12 January 2022

ER -