UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads

Arnab Phani, Lukas Erlbacher, Matthias Boehm

Publikation: Beitrag in einer FachzeitschriftKonferenzartikelBegutachtung

Abstract

Data science pipelines are typically exploratory. An integral task of such pipelines are feature transformations, which transform raw data into numerical matrices or tensors for training or scoring. There exist a wide variety of transformations for different data modalities. These feature transformations incur large computational overhead due to expensive string processing and dictionary creation. Existing ML systems address this overhead by static parallelization schemes and interleaving transformations with model training. These approaches show good performance improvements for simple transformations, but struggle to handle different data characteristics (many features/distinct items) and multi-pass transformations. A key observation is that good parallelization strategies for feature transformations depend on data characteristics. In this paper, we introduce UPLIFT, a framework for ParalleLI zing Feature Transformations. UPLIFT constructs a fine-grained task graph for a set of transformations, optimizes the plan according to data characteristics, and executes this plan in a cache-conscious manner. We show that the resulting framework is applicable to a wide range of transformations. Furthermore, we propose the FTBench benchmark with transformations and datasets from various domains. On this benchmark, UPLIFT yields speedups of up to 31.6x (9.27x on average) compared to state-of-the-art ML systems.

Originalspracheenglisch
Seiten (von - bis)2929-2938
Seitenumfang10
FachzeitschriftProceedings of the VLDB Endowment
Jahrgang15
Ausgabenummer11
DOIs
PublikationsstatusVeröffentlicht - 2022
Veranstaltung48th International Conference on Very Large Data Bases, VLDB 2022 - Sydney, Australien
Dauer: 5 Sept. 20229 Sept. 2022

ASJC Scopus subject areas

  • Informatik (sonstige)
  • Allgemeine Computerwissenschaft

Dieses zitieren