Speculative parallel reverse cuthill-mckee reordering on multi- And many-core architectures

Daniel Mlakar; Martin Winter; Mathias Parger; Markus Steinberger

doi:10.1109/IPDPS49936.2021.00080

Speculative parallel reverse cuthill-mckee reordering on multi- And many-core architectures

Daniel Mlakar, Martin Winter, Mathias Parger, Markus Steinberger

Institute of Computer Graphics and Vision (7100)

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

Abstract

Bandwidth reduction of sparse matrices is used to reduce fill-in of linear solvers and to increase performance of other sparse matrix operations, e.g., sparse matrix vector multiplication in iterative solvers. To compute a bandwidth reducing permutation, Reverse Cuthill-McKee (RCM) reordering is often applied, which is challenging to parallelize, as its core is inherently serial. As many-core architectures, like the GPU, offer subpar single-threading performance and are typically only connected to high-performance CPU cores via a slow memory bus, neither computing RCM on the GPU nor moving the data to the CPU are viable options. Nevertheless, reordering matrices, potentially multiple times in-between operations, might be essential for high throughput. Still, to the best of our knowledge, we are the first to propose an RCM implementation that can execute on multicore CPUs and many-core GPUs alike, moving the computation to the data rather than vice versa.Our algorithm parallelizes RCM into mostly independent batches of nodes. For every batch, a single CPU-thread/a GPU thread-block speculatively discovers child nodes and sorts them according to the RCM algorithm. Before writing their permutation, we re-evaluate the discovery and build new batches. To increase parallelism and reduce dependencies, we create a signaling chain along successive batches and introduce early signaling conditions. In combination with a parallel work queue, new batches are started in order and the resulting RCM permutation is identical to the ground-truth single-threaded algorithm.We propose the first RCM implementation that runs on the GPU. It achieves several orders of magnitude speed-up over NVIDIA's single-threaded cuSolver RCM implementation and is significantly faster than previous parallel CPU approaches. Our results are especially significant for many-core architectures, as it is now possible to include RCM reordering into sequences of sparse matrix operations without major performance loss.

Original language	English
Title of host publication	Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021
Publisher	Institute of Electrical and Electronics Engineers
Pages	703-713
Number of pages	11
ISBN (Electronic)	9781665440660
DOIs	https://doi.org/10.1109/IPDPS49936.2021.00080
Publication status	Published - May 2021
Event	35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021 - Virtual, Online Duration: 17 May 2021 → 21 May 2021

Publication series

Name	Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021

Conference

Conference	35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021
City	Virtual, Online
Period	17/05/21 → 21/05/21

Keywords

CPU
GPU
Many-core
Multicore
Reverse Cuthill-McKee
Scheduling
Work distribution

ASJC Scopus subject areas

Computer Networks and Communications
Hardware and Architecture

Access to Document

10.1109/IPDPS49936.2021.00080

Cite this

Mlakar, D., Winter, M., Parger, M., & Steinberger, M. (2021). Speculative parallel reverse cuthill-mckee reordering on multi- And many-core architectures. In Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021 (pp. 703-713). Article 9460553 (Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/IPDPS49936.2021.00080

Speculative parallel reverse cuthill-mckee reordering on multi- And many-core architectures. / Mlakar, Daniel; Winter, Martin; Parger, Mathias et al.
Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021. Institute of Electrical and Electronics Engineers, 2021. p. 703-713 9460553 (Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021).

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

Mlakar, D, Winter, M, Parger, M & Steinberger, M 2021, Speculative parallel reverse cuthill-mckee reordering on multi- And many-core architectures. in Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021., 9460553, Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021, Institute of Electrical and Electronics Engineers, pp. 703-713, 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Virtual, Online, 17/05/21. https://doi.org/10.1109/IPDPS49936.2021.00080

Mlakar D, Winter M, Parger M, Steinberger M. Speculative parallel reverse cuthill-mckee reordering on multi- And many-core architectures. In Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021. Institute of Electrical and Electronics Engineers. 2021. p. 703-713. 9460553. (Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021). doi: 10.1109/IPDPS49936.2021.00080

Mlakar, Daniel ; Winter, Martin ; Parger, Mathias et al. / Speculative parallel reverse cuthill-mckee reordering on multi- And many-core architectures. Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021. Institute of Electrical and Electronics Engineers, 2021. pp. 703-713 (Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021).

@inproceedings{b5ae8b0d996446898a14427914bf390a,

title = "Speculative parallel reverse cuthill-mckee reordering on multi- And many-core architectures",

abstract = "Bandwidth reduction of sparse matrices is used to reduce fill-in of linear solvers and to increase performance of other sparse matrix operations, e.g., sparse matrix vector multiplication in iterative solvers. To compute a bandwidth reducing permutation, Reverse Cuthill-McKee (RCM) reordering is often applied, which is challenging to parallelize, as its core is inherently serial. As many-core architectures, like the GPU, offer subpar single-threading performance and are typically only connected to high-performance CPU cores via a slow memory bus, neither computing RCM on the GPU nor moving the data to the CPU are viable options. Nevertheless, reordering matrices, potentially multiple times in-between operations, might be essential for high throughput. Still, to the best of our knowledge, we are the first to propose an RCM implementation that can execute on multicore CPUs and many-core GPUs alike, moving the computation to the data rather than vice versa.Our algorithm parallelizes RCM into mostly independent batches of nodes. For every batch, a single CPU-thread/a GPU thread-block speculatively discovers child nodes and sorts them according to the RCM algorithm. Before writing their permutation, we re-evaluate the discovery and build new batches. To increase parallelism and reduce dependencies, we create a signaling chain along successive batches and introduce early signaling conditions. In combination with a parallel work queue, new batches are started in order and the resulting RCM permutation is identical to the ground-truth single-threaded algorithm.We propose the first RCM implementation that runs on the GPU. It achieves several orders of magnitude speed-up over NVIDIA's single-threaded cuSolver RCM implementation and is significantly faster than previous parallel CPU approaches. Our results are especially significant for many-core architectures, as it is now possible to include RCM reordering into sequences of sparse matrix operations without major performance loss.",

keywords = "CPU, GPU, Many-core, Multicore, Reverse Cuthill-McKee, Scheduling, Work distribution",

author = "Daniel Mlakar and Martin Winter and Mathias Parger and Markus Steinberger",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE.; 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021 ; Conference date: 17-05-2021 Through 21-05-2021",

year = "2021",

month = may,

doi = "10.1109/IPDPS49936.2021.00080",

language = "English",

series = "Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021",

publisher = "Institute of Electrical and Electronics Engineers",

pages = "703--713",

booktitle = "Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021",

address = "United States",

}

TY - GEN

T1 - Speculative parallel reverse cuthill-mckee reordering on multi- And many-core architectures

AU - Mlakar, Daniel

AU - Winter, Martin

AU - Parger, Mathias

AU - Steinberger, Markus

PY - 2021/5

Y1 - 2021/5

N2 - Bandwidth reduction of sparse matrices is used to reduce fill-in of linear solvers and to increase performance of other sparse matrix operations, e.g., sparse matrix vector multiplication in iterative solvers. To compute a bandwidth reducing permutation, Reverse Cuthill-McKee (RCM) reordering is often applied, which is challenging to parallelize, as its core is inherently serial. As many-core architectures, like the GPU, offer subpar single-threading performance and are typically only connected to high-performance CPU cores via a slow memory bus, neither computing RCM on the GPU nor moving the data to the CPU are viable options. Nevertheless, reordering matrices, potentially multiple times in-between operations, might be essential for high throughput. Still, to the best of our knowledge, we are the first to propose an RCM implementation that can execute on multicore CPUs and many-core GPUs alike, moving the computation to the data rather than vice versa.Our algorithm parallelizes RCM into mostly independent batches of nodes. For every batch, a single CPU-thread/a GPU thread-block speculatively discovers child nodes and sorts them according to the RCM algorithm. Before writing their permutation, we re-evaluate the discovery and build new batches. To increase parallelism and reduce dependencies, we create a signaling chain along successive batches and introduce early signaling conditions. In combination with a parallel work queue, new batches are started in order and the resulting RCM permutation is identical to the ground-truth single-threaded algorithm.We propose the first RCM implementation that runs on the GPU. It achieves several orders of magnitude speed-up over NVIDIA's single-threaded cuSolver RCM implementation and is significantly faster than previous parallel CPU approaches. Our results are especially significant for many-core architectures, as it is now possible to include RCM reordering into sequences of sparse matrix operations without major performance loss.

AB - Bandwidth reduction of sparse matrices is used to reduce fill-in of linear solvers and to increase performance of other sparse matrix operations, e.g., sparse matrix vector multiplication in iterative solvers. To compute a bandwidth reducing permutation, Reverse Cuthill-McKee (RCM) reordering is often applied, which is challenging to parallelize, as its core is inherently serial. As many-core architectures, like the GPU, offer subpar single-threading performance and are typically only connected to high-performance CPU cores via a slow memory bus, neither computing RCM on the GPU nor moving the data to the CPU are viable options. Nevertheless, reordering matrices, potentially multiple times in-between operations, might be essential for high throughput. Still, to the best of our knowledge, we are the first to propose an RCM implementation that can execute on multicore CPUs and many-core GPUs alike, moving the computation to the data rather than vice versa.Our algorithm parallelizes RCM into mostly independent batches of nodes. For every batch, a single CPU-thread/a GPU thread-block speculatively discovers child nodes and sorts them according to the RCM algorithm. Before writing their permutation, we re-evaluate the discovery and build new batches. To increase parallelism and reduce dependencies, we create a signaling chain along successive batches and introduce early signaling conditions. In combination with a parallel work queue, new batches are started in order and the resulting RCM permutation is identical to the ground-truth single-threaded algorithm.We propose the first RCM implementation that runs on the GPU. It achieves several orders of magnitude speed-up over NVIDIA's single-threaded cuSolver RCM implementation and is significantly faster than previous parallel CPU approaches. Our results are especially significant for many-core architectures, as it is now possible to include RCM reordering into sequences of sparse matrix operations without major performance loss.

KW - CPU

KW - GPU

KW - Many-core

KW - Multicore

KW - Reverse Cuthill-McKee

KW - Scheduling

KW - Work distribution

UR - http://www.scopus.com/inward/record.url?scp=85113528312&partnerID=8YFLogxK

U2 - 10.1109/IPDPS49936.2021.00080

DO - 10.1109/IPDPS49936.2021.00080

M3 - Conference paper

AN - SCOPUS:85113528312

T3 - Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021

SP - 703

EP - 713

BT - Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021

PB - Institute of Electrical and Electronics Engineers

T2 - 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021

Y2 - 17 May 2021 through 21 May 2021

ER -

Speculative parallel reverse cuthill-mckee reordering on multi- And many-core architectures

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this