UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads

Arnab Phani, Lukas Erlbacher, Matthias Boehm

Research output: Contribution to journalConference articlepeer-review

Abstract

Data science pipelines are typically exploratory. An integral task of such pipelines are feature transformations, which transform raw data into numerical matrices or tensors for training or scoring. There exist a wide variety of transformations for different data modalities. These feature transformations incur large computational overhead due to expensive string processing and dictionary creation. Existing ML systems address this overhead by static parallelization schemes and interleaving transformations with model training. These approaches show good performance improvements for simple transformations, but struggle to handle different data characteristics (many features/distinct items) and multi-pass transformations. A key observation is that good parallelization strategies for feature transformations depend on data characteristics. In this paper, we introduce UPLIFT, a framework for ParalleLI zing Feature Transformations. UPLIFT constructs a fine-grained task graph for a set of transformations, optimizes the plan according to data characteristics, and executes this plan in a cache-conscious manner. We show that the resulting framework is applicable to a wide range of transformations. Furthermore, we propose the FTBench benchmark with transformations and datasets from various domains. On this benchmark, UPLIFT yields speedups of up to 31.6x (9.27x on average) compared to state-of-the-art ML systems.

Original languageEnglish
Pages (from-to)2929-2938
Number of pages10
JournalProceedings of the VLDB Endowment
Volume15
Issue number11
DOIs
Publication statusPublished - 2022
Event48th International Conference on Very Large Data Bases, VLDB 2022 - Sydney, Australia
Duration: 5 Sept 20229 Sept 2022

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • General Computer Science

Fingerprint

Dive into the research topics of 'UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads'. Together they form a unique fingerprint.

Cite this