UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads

Arnab Phani; Lukas Erlbacher; Matthias Boehm

doi:10.14778/3551793.3551842

UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads

Arnab Phani, Lukas Erlbacher, Matthias Boehm

Research output: Contribution to journal › Conference article › peer-review

Abstract

Data science pipelines are typically exploratory. An integral task of such pipelines are feature transformations, which transform raw data into numerical matrices or tensors for training or scoring. There exist a wide variety of transformations for different data modalities. These feature transformations incur large computational overhead due to expensive string processing and dictionary creation. Existing ML systems address this overhead by static parallelization schemes and interleaving transformations with model training. These approaches show good performance improvements for simple transformations, but struggle to handle different data characteristics (many features/distinct items) and multi-pass transformations. A key observation is that good parallelization strategies for feature transformations depend on data characteristics. In this paper, we introduce UPLIFT, a framework for ParalleLI zing Feature Transformations. UPLIFT constructs a fine-grained task graph for a set of transformations, optimizes the plan according to data characteristics, and executes this plan in a cache-conscious manner. We show that the resulting framework is applicable to a wide range of transformations. Furthermore, we propose the FTBench benchmark with transformations and datasets from various domains. On this benchmark, UPLIFT yields speedups of up to 31.6x (9.27x on average) compared to state-of-the-art ML systems.

Original language	English
Pages (from-to)	2929-2938
Number of pages	10
Journal	Proceedings of the VLDB Endowment
Volume	15
Issue number	11
DOIs	https://doi.org/10.14778/3551793.3551842
Publication status	Published - 2022
Event	48th International Conference on Very Large Data Bases, VLDB 2022 - Sydney, Australia Duration: 5 Sept 2022 → 9 Sept 2022

ASJC Scopus subject areas

Computer Science (miscellaneous)
General Computer Science

Access to Document

10.14778/3551793.3551842

Cite this

@article{e7a4fd681e6140a98a7e18da07f66b74,

title = "UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads",

abstract = "Data science pipelines are typically exploratory. An integral task of such pipelines are feature transformations, which transform raw data into numerical matrices or tensors for training or scoring. There exist a wide variety of transformations for different data modalities. These feature transformations incur large computational overhead due to expensive string processing and dictionary creation. Existing ML systems address this overhead by static parallelization schemes and interleaving transformations with model training. These approaches show good performance improvements for simple transformations, but struggle to handle different data characteristics (many features/distinct items) and multi-pass transformations. A key observation is that good parallelization strategies for feature transformations depend on data characteristics. In this paper, we introduce UPLIFT, a framework for ParalleLI zing Feature Transformations. UPLIFT constructs a fine-grained task graph for a set of transformations, optimizes the plan according to data characteristics, and executes this plan in a cache-conscious manner. We show that the resulting framework is applicable to a wide range of transformations. Furthermore, we propose the FTBench benchmark with transformations and datasets from various domains. On this benchmark, UPLIFT yields speedups of up to 31.6x (9.27x on average) compared to state-of-the-art ML systems.",

author = "Arnab Phani and Lukas Erlbacher and Matthias Boehm",

note = "Publisher Copyright: {\textcopyright} 2022, VLDB Endowment. All rights reserved.; 48th International Conference on Very Large Data Bases, VLDB 2022 ; Conference date: 05-09-2022 Through 09-09-2022",

year = "2022",

doi = "10.14778/3551793.3551842",

language = "English",

volume = "15",

pages = "2929--2938",

journal = "Proceedings of the VLDB Endowment",

issn = "2150-8097",

publisher = "Association of Computing Machinery",

number = "11",

}

TY - JOUR

T1 - UPLIFT

T2 - 48th International Conference on Very Large Data Bases, VLDB 2022

AU - Phani, Arnab

AU - Erlbacher, Lukas

AU - Boehm, Matthias

PY - 2022

Y1 - 2022

N2 - Data science pipelines are typically exploratory. An integral task of such pipelines are feature transformations, which transform raw data into numerical matrices or tensors for training or scoring. There exist a wide variety of transformations for different data modalities. These feature transformations incur large computational overhead due to expensive string processing and dictionary creation. Existing ML systems address this overhead by static parallelization schemes and interleaving transformations with model training. These approaches show good performance improvements for simple transformations, but struggle to handle different data characteristics (many features/distinct items) and multi-pass transformations. A key observation is that good parallelization strategies for feature transformations depend on data characteristics. In this paper, we introduce UPLIFT, a framework for ParalleLI zing Feature Transformations. UPLIFT constructs a fine-grained task graph for a set of transformations, optimizes the plan according to data characteristics, and executes this plan in a cache-conscious manner. We show that the resulting framework is applicable to a wide range of transformations. Furthermore, we propose the FTBench benchmark with transformations and datasets from various domains. On this benchmark, UPLIFT yields speedups of up to 31.6x (9.27x on average) compared to state-of-the-art ML systems.

AB - Data science pipelines are typically exploratory. An integral task of such pipelines are feature transformations, which transform raw data into numerical matrices or tensors for training or scoring. There exist a wide variety of transformations for different data modalities. These feature transformations incur large computational overhead due to expensive string processing and dictionary creation. Existing ML systems address this overhead by static parallelization schemes and interleaving transformations with model training. These approaches show good performance improvements for simple transformations, but struggle to handle different data characteristics (many features/distinct items) and multi-pass transformations. A key observation is that good parallelization strategies for feature transformations depend on data characteristics. In this paper, we introduce UPLIFT, a framework for ParalleLI zing Feature Transformations. UPLIFT constructs a fine-grained task graph for a set of transformations, optimizes the plan according to data characteristics, and executes this plan in a cache-conscious manner. We show that the resulting framework is applicable to a wide range of transformations. Furthermore, we propose the FTBench benchmark with transformations and datasets from various domains. On this benchmark, UPLIFT yields speedups of up to 31.6x (9.27x on average) compared to state-of-the-art ML systems.

UR - http://www.scopus.com/inward/record.url?scp=85138001085&partnerID=8YFLogxK

U2 - 10.14778/3551793.3551842

DO - 10.14778/3551793.3551842

M3 - Conference article

AN - SCOPUS:85138001085

SN - 2150-8097

VL - 15

SP - 2929

EP - 2938

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 11

Y2 - 5 September 2022 through 9 September 2022

ER -

UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this