ExDRa - Exploratory Data Science over Raw Data

Project: Research project

Description

Machine learning (ML) applications based on large data are increasing applied in the enterprise to improve the value chain and gain competitive advantage. In contrast to traditional ML, the objectives are, however, under-specified, allow for different types of analysis, and can leverage a wide variety of heterogeneous, distributed and partially inaccessible data sources. Therefore, the typical data science process in the enterprise is exploratory, that is, data scientists investigate hypotheses, integrate the necessary data, run different analytics, and look for interesting patterns and models. Since the added value is unknown in advance, very little investments are made into the systematic acquisition, integration, and preprocessing of data. This lack of infrastructure results in redundancy of manual steps and inefficient computation. Furthermore, the central consolidation is not always technically or economically desirable or even possible (e.g., sensitive personal data). These scenarios share the necessity of federated execution and dedicated elimination of redundancy. The basic idea of the ExDRa project is an investigation of suitable systems support for this exploratory data science process over heterogeneous and distributed raw data sources, showcased in a demonstrator for practical applications.

In detail, this approach entails the following research aspects:
(1) ad-hoc and federated data integration over raw data,
(2) data organization and reuse of intermediates,
(3) horizontal optimization over the entire data science lifecycle, and
(4) query planning for partially accessible data.

Use cases come from the process industry. In this context, there are large amounts of data, distributed over locations and appliances, but whose consolidation is technically, economically, and legally limited. The overall goal leads to four research goals. First, data integration, data processing and analysis over raw data needs to be enabled via a suitable declarative specification of data source and preprocessing steps, as well as efficient primitives for local and federated computation. In the context of exploratory data science, this requires sampling and incremental maintenance. Second, unnecessary redundancy and inefficiency of repeated computations need to be addressed via dedicated techniques for data organization and reuse. The high communication overhead of federated analysis could further benefit from leveraging compression techniques and the performance-accuracy tradeoff. Third, we aim to improve the understanding of exploratory analysis results and simply future analysis via systematic model management and optimization of experiments. Fourth, federated computation is an essential part of exploratory analysis over raw data. Accordingly, we intend to investigate system architectures, as well as query optimization and processing. In order to provide evidence for practical relevance, all results will be integrated and evaluated as part of a demonstrator software.
StatusActive
Effective start/end date1/06/1931/05/22