Resource-Efficient Speech Mask Estimation for Multi-Channel Speech Enhancement

Lukas Pfeifenberger; Matthias Zöhrer; Wolfgang Roth; Günther Schindler; Holger Fröning; Franz Pernkopf

Resource-Efficient Speech Mask Estimation for Multi-Channel Speech Enhancement

Lukas Pfeifenberger, Matthias Zöhrer, Wolfgang Roth, Günther Schindler, Holger Fröning, Franz Pernkopf

Institute of Signal Processing and Speech Communication (4420)

Research output: Working paper › Preprint

Abstract

While machine learning techniques are traditionally resource intensive, we are currently witnessing an increased interest in hardware and energy efficient approaches. This need for resource-efficient machine learning is primarily driven by the demand for embedded systems and their usage in ubiquitous computing and IoT applications. In this article, we provide a resource-efficient approach for multi-channel speech enhancement based on Deep Neural Networks (DNNs). In particular, we use reduced-precision DNNs for estimating a speech mask from noisy, multi-channel microphone observations. This speech mask is used to obtain either the Minimum Variance Distortionless Response (MVDR) or Generalized Eigenvalue (GEV) beamformer. In the extreme case of binary weights and reduced precision activations, a significant reduction of execution time and memory footprint is possible while still obtaining an audio quality almost on par to single-precision DNNs and a slightly larger Word Error Rate (WER) for single speaker scenarios using the WSJ0 speech corpus

Original language	English
Number of pages	13
Publication status	Published - 2020

Publication series

Name	arXiv.org e-Print archive
Publisher	Cornell University Library

Access to Document

http://arxiv.org/abs/2007.11477

Cite this

@techreport{30dac6a07d6f442ea63c7477d918c5fb,

title = "Resource-Efficient Speech Mask Estimation for Multi-Channel Speech Enhancement",

abstract = "While machine learning techniques are traditionally resource intensive, we are currently witnessing an increased interest in hardware and energy efficient approaches. This need for resource-efficient machine learning is primarily driven by the demand for embedded systems and their usage in ubiquitous computing and IoT applications. In this article, we provide a resource-efficient approach for multi-channel speech enhancement based on Deep Neural Networks (DNNs). In particular, we use reduced-precision DNNs for estimating a speech mask from noisy, multi-channel microphone observations. This speech mask is used to obtain either the Minimum Variance Distortionless Response (MVDR) or Generalized Eigenvalue (GEV) beamformer. In the extreme case of binary weights and reduced precision activations, a significant reduction of execution time and memory footprint is possible while still obtaining an audio quality almost on par to single-precision DNNs and a slightly larger Word Error Rate (WER) for single speaker scenarios using the WSJ0 speech corpus",

author = "Lukas Pfeifenberger and Matthias Z{\"o}hrer and Wolfgang Roth and G{\"u}nther Schindler and Holger Fr{\"o}ning and Franz Pernkopf",

year = "2020",

language = "English",

series = "arXiv.org e-Print archive",

publisher = "Cornell University Library",

type = "WorkingPaper",

institution = "Cornell University Library",

}

TY - UNPB

T1 - Resource-Efficient Speech Mask Estimation for Multi-Channel Speech Enhancement

AU - Pfeifenberger, Lukas

AU - Zöhrer, Matthias

AU - Roth, Wolfgang

AU - Schindler, Günther

AU - Fröning, Holger

AU - Pernkopf, Franz

PY - 2020

Y1 - 2020

N2 - While machine learning techniques are traditionally resource intensive, we are currently witnessing an increased interest in hardware and energy efficient approaches. This need for resource-efficient machine learning is primarily driven by the demand for embedded systems and their usage in ubiquitous computing and IoT applications. In this article, we provide a resource-efficient approach for multi-channel speech enhancement based on Deep Neural Networks (DNNs). In particular, we use reduced-precision DNNs for estimating a speech mask from noisy, multi-channel microphone observations. This speech mask is used to obtain either the Minimum Variance Distortionless Response (MVDR) or Generalized Eigenvalue (GEV) beamformer. In the extreme case of binary weights and reduced precision activations, a significant reduction of execution time and memory footprint is possible while still obtaining an audio quality almost on par to single-precision DNNs and a slightly larger Word Error Rate (WER) for single speaker scenarios using the WSJ0 speech corpus

AB - While machine learning techniques are traditionally resource intensive, we are currently witnessing an increased interest in hardware and energy efficient approaches. This need for resource-efficient machine learning is primarily driven by the demand for embedded systems and their usage in ubiquitous computing and IoT applications. In this article, we provide a resource-efficient approach for multi-channel speech enhancement based on Deep Neural Networks (DNNs). In particular, we use reduced-precision DNNs for estimating a speech mask from noisy, multi-channel microphone observations. This speech mask is used to obtain either the Minimum Variance Distortionless Response (MVDR) or Generalized Eigenvalue (GEV) beamformer. In the extreme case of binary weights and reduced precision activations, a significant reduction of execution time and memory footprint is possible while still obtaining an audio quality almost on par to single-precision DNNs and a slightly larger Word Error Rate (WER) for single speaker scenarios using the WSJ0 speech corpus

M3 - Preprint

T3 - arXiv.org e-Print archive

BT - Resource-Efficient Speech Mask Estimation for Multi-Channel Speech Enhancement

ER -

Resource-Efficient Speech Mask Estimation for Multi-Channel Speech Enhancement

Abstract

Publication series

Access to Document

Fingerprint

Cite this