PySpark and RDKit: Moving towards Big Data in Cheminformatics

Mario Lovric; Roman Kern; Jose Molero

doi:10.1002/minf.201800082

PySpark and RDKit: Moving towards Big Data in Cheminformatics

Mario Lovric, Roman Kern, Jose Molero

Publikation: Beitrag in einer Fachzeitschrift › Artikel › Begutachtung

Abstract

The authors present an implementation of the cheminformatics toolkit RDKit in a distributed computing environment, Apache Hadoop. Together with the Apache Spark analytics engine, wrapped by PySpark, resources from commodity scalable hardware can be employed for cheminformatic calculations and query operations with basic knowledge in Python programming and understanding of the resilient distributed datasets (RDD). Three use cases of cheminfomatical computing in Spark on the Hadoop cluster are presented; querying substructures, calculating fingerprint similarity and calculating molecular descriptors. The source code for the PySpark‐RDKit implementation is provided. The use cases showed that Spark provides a reasonable scalability depending on the use case and can be a suitable choice for datasets too big to be processed with current low‐end workstations.

Originalsprache	englisch
Aufsatznummer	1800082
Seitenumfang	4
Fachzeitschrift	Molecular Informatics
Jahrgang	38
Ausgabenummer	6
DOIs	https://doi.org/10.1002/minf.201800082
Publikationsstatus	Veröffentlicht - 7 März 2019

Zugriff auf Dokument

10.1002/minf.201800082

Dieses zitieren

@article{0cdc77618f5947d096623b3a36f199a0,

title = "PySpark and RDKit: Moving towards Big Data in Cheminformatics",

abstract = "The authors present an implementation of the cheminformatics toolkit RDKit in a distributed computing environment, Apache Hadoop. Together with the Apache Spark analytics engine, wrapped by PySpark, resources from commodity scalable hardware can be employed for cheminformatic calculations and query operations with basic knowledge in Python programming and understanding of the resilient distributed datasets (RDD). Three use cases of cheminfomatical computing in Spark on the Hadoop cluster are presented; querying substructures, calculating fingerprint similarity and calculating molecular descriptors. The source code for the PySpark‐RDKit implementation is provided. The use cases showed that Spark provides a reasonable scalability depending on the use case and can be a suitable choice for datasets too big to be processed with current low‐end workstations.",

author = "Mario Lovric and Roman Kern and Jose Molero",

year = "2019",

month = mar,

day = "7",

doi = "10.1002/minf.201800082",

language = "English",

volume = "38",

journal = "Molecular Informatics",

issn = "1868-1751",

publisher = "Wiley",

number = "6",

}

TY - JOUR

T1 - PySpark and RDKit: Moving towards Big Data in Cheminformatics

AU - Lovric, Mario

AU - Kern, Roman

AU - Molero, Jose

PY - 2019/3/7

Y1 - 2019/3/7

N2 - The authors present an implementation of the cheminformatics toolkit RDKit in a distributed computing environment, Apache Hadoop. Together with the Apache Spark analytics engine, wrapped by PySpark, resources from commodity scalable hardware can be employed for cheminformatic calculations and query operations with basic knowledge in Python programming and understanding of the resilient distributed datasets (RDD). Three use cases of cheminfomatical computing in Spark on the Hadoop cluster are presented; querying substructures, calculating fingerprint similarity and calculating molecular descriptors. The source code for the PySpark‐RDKit implementation is provided. The use cases showed that Spark provides a reasonable scalability depending on the use case and can be a suitable choice for datasets too big to be processed with current low‐end workstations.

AB - The authors present an implementation of the cheminformatics toolkit RDKit in a distributed computing environment, Apache Hadoop. Together with the Apache Spark analytics engine, wrapped by PySpark, resources from commodity scalable hardware can be employed for cheminformatic calculations and query operations with basic knowledge in Python programming and understanding of the resilient distributed datasets (RDD). Three use cases of cheminfomatical computing in Spark on the Hadoop cluster are presented; querying substructures, calculating fingerprint similarity and calculating molecular descriptors. The source code for the PySpark‐RDKit implementation is provided. The use cases showed that Spark provides a reasonable scalability depending on the use case and can be a suitable choice for datasets too big to be processed with current low‐end workstations.

U2 - 10.1002/minf.201800082

DO - 10.1002/minf.201800082

M3 - Article

SN - 1868-1751

VL - 38

JO - Molecular Informatics

JF - Molecular Informatics

IS - 6

M1 - 1800082

ER -

PySpark and RDKit: Moving towards Big Data in Cheminformatics

Abstract

Zugriff auf Dokument

Fingerprint

Dieses zitieren