Speculative parallel reverse cuthill-mckee reordering on multi- And many-core architectures

Daniel Mlakar, Martin Winter, Mathias Parger, Markus Steinberger

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review

Abstract

Bandwidth reduction of sparse matrices is used to reduce fill-in of linear solvers and to increase performance of other sparse matrix operations, e.g., sparse matrix vector multiplication in iterative solvers. To compute a bandwidth reducing permutation, Reverse Cuthill-McKee (RCM) reordering is often applied, which is challenging to parallelize, as its core is inherently serial. As many-core architectures, like the GPU, offer subpar single-threading performance and are typically only connected to high-performance CPU cores via a slow memory bus, neither computing RCM on the GPU nor moving the data to the CPU are viable options. Nevertheless, reordering matrices, potentially multiple times in-between operations, might be essential for high throughput. Still, to the best of our knowledge, we are the first to propose an RCM implementation that can execute on multicore CPUs and many-core GPUs alike, moving the computation to the data rather than vice versa.Our algorithm parallelizes RCM into mostly independent batches of nodes. For every batch, a single CPU-thread/a GPU thread-block speculatively discovers child nodes and sorts them according to the RCM algorithm. Before writing their permutation, we re-evaluate the discovery and build new batches. To increase parallelism and reduce dependencies, we create a signaling chain along successive batches and introduce early signaling conditions. In combination with a parallel work queue, new batches are started in order and the resulting RCM permutation is identical to the ground-truth single-threaded algorithm.We propose the first RCM implementation that runs on the GPU. It achieves several orders of magnitude speed-up over NVIDIA's single-threaded cuSolver RCM implementation and is significantly faster than previous parallel CPU approaches. Our results are especially significant for many-core architectures, as it is now possible to include RCM reordering into sequences of sparse matrix operations without major performance loss.

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021
PublisherInstitute of Electrical and Electronics Engineers
Pages703-713
Number of pages11
ISBN (Electronic)9781665440660
DOIs
Publication statusPublished - May 2021
Event35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021 - Virtual, Online
Duration: 17 May 202121 May 2021

Publication series

NameProceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021

Conference

Conference35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021
CityVirtual, Online
Period17/05/2121/05/21

Keywords

  • CPU
  • GPU
  • Many-core
  • Multicore
  • Reverse Cuthill-McKee
  • Scheduling
  • Work distribution

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Speculative parallel reverse cuthill-mckee reordering on multi- And many-core architectures'. Together they form a unique fingerprint.

Cite this