A Pitch-Synchronous Simultaneous Detection-Estimation Framework for Speech Enhancement

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Speech enhancement methods formulated in the short-time Fourier transform (STFT) domain vary in the statistical assumptions made on the STFT coefficients, in the optimization criteria applied or in the models of the signal components. Recently, approaches relying on a stochastic-deterministic speech model have been proposed. The deterministic part of the signal corresponds to harmonically related sinusoids, often used to represent voiced speech. The stochastic part models signal components that are not captured by the deterministic components. In this paper, we consider this scenario under a new perspective yielding three main contributions. First, a pitch-synchronous signal representation is considered and shown to be advantageous for the estimation of the harmonic model parameters. Second, we model the harmonic amplitudes in voiced speech as random variables with frequency bin dependent Gamma distributions. Finally, distinct estimators for the different models of voiced speech, unvoiced speech, and speech absence are derived. To select from the arising estimates, we take into account the mutual impact of detection and estimation by proposing a binary decision framework that is derived from a Bayesian risk function. The resulting pitch-synchronous stochastic-deterministic estimator outperforms several benchmark methods in terms of speech intelligibility and perceived quality predicted by instrumental measures for various noise types and different signal-to-noise ratios.
Original languageEnglish
Pages (from-to)436-450
Number of pages15
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume26
Issue number2
DOIs
Publication statusPublished - 4 Dec 2017

Fingerprint

Speech enhancement
augmentation
Fourier Analysis
Fourier transforms
estimators
Speech intelligibility
Speech Intelligibility
Benchmarking
Bins
harmonics
Random variables
Signal-To-Noise Ratio
intelligibility
random variables
Signal to noise ratio
sine waves
Noise
signal to noise ratios
scenario
optimization

Cite this

A Pitch-Synchronous Simultaneous Detection-Estimation Framework for Speech Enhancement. / Stahl, Johannes; Mowlaee Beikzadehmahaleh, Pejman.

In: IEEE/ACM Transactions on Audio Speech and Language Processing, Vol. 26, No. 2, 04.12.2017, p. 436-450.

Research output: Contribution to journalArticleResearchpeer-review

@article{975fe2ba60824d64a897f3472e0ffd4b,
title = "A Pitch-Synchronous Simultaneous Detection-Estimation Framework for Speech Enhancement",
abstract = "Speech enhancement methods formulated in the short-time Fourier transform (STFT) domain vary in the statistical assumptions made on the STFT coefficients, in the optimization criteria applied or in the models of the signal components. Recently, approaches relying on a stochastic-deterministic speech model have been proposed. The deterministic part of the signal corresponds to harmonically related sinusoids, often used to represent voiced speech. The stochastic part models signal components that are not captured by the deterministic components. In this paper, we consider this scenario under a new perspective yielding three main contributions. First, a pitch-synchronous signal representation is considered and shown to be advantageous for the estimation of the harmonic model parameters. Second, we model the harmonic amplitudes in voiced speech as random variables with frequency bin dependent Gamma distributions. Finally, distinct estimators for the different models of voiced speech, unvoiced speech, and speech absence are derived. To select from the arising estimates, we take into account the mutual impact of detection and estimation by proposing a binary decision framework that is derived from a Bayesian risk function. The resulting pitch-synchronous stochastic-deterministic estimator outperforms several benchmark methods in terms of speech intelligibility and perceived quality predicted by instrumental measures for various noise types and different signal-to-noise ratios.",
author = "Johannes Stahl and {Mowlaee Beikzadehmahaleh}, Pejman",
year = "2017",
month = "12",
day = "4",
doi = "10.1109/TASLP.2017.2779405",
language = "English",
volume = "26",
pages = "436--450",
journal = "IEEE ACM Transactions on Audio Speech and Language Processing",
issn = "2329-9290",
publisher = "Institute of Electrical and Electronics Engineers",
number = "2",

}

TY - JOUR

T1 - A Pitch-Synchronous Simultaneous Detection-Estimation Framework for Speech Enhancement

AU - Stahl, Johannes

AU - Mowlaee Beikzadehmahaleh, Pejman

PY - 2017/12/4

Y1 - 2017/12/4

N2 - Speech enhancement methods formulated in the short-time Fourier transform (STFT) domain vary in the statistical assumptions made on the STFT coefficients, in the optimization criteria applied or in the models of the signal components. Recently, approaches relying on a stochastic-deterministic speech model have been proposed. The deterministic part of the signal corresponds to harmonically related sinusoids, often used to represent voiced speech. The stochastic part models signal components that are not captured by the deterministic components. In this paper, we consider this scenario under a new perspective yielding three main contributions. First, a pitch-synchronous signal representation is considered and shown to be advantageous for the estimation of the harmonic model parameters. Second, we model the harmonic amplitudes in voiced speech as random variables with frequency bin dependent Gamma distributions. Finally, distinct estimators for the different models of voiced speech, unvoiced speech, and speech absence are derived. To select from the arising estimates, we take into account the mutual impact of detection and estimation by proposing a binary decision framework that is derived from a Bayesian risk function. The resulting pitch-synchronous stochastic-deterministic estimator outperforms several benchmark methods in terms of speech intelligibility and perceived quality predicted by instrumental measures for various noise types and different signal-to-noise ratios.

AB - Speech enhancement methods formulated in the short-time Fourier transform (STFT) domain vary in the statistical assumptions made on the STFT coefficients, in the optimization criteria applied or in the models of the signal components. Recently, approaches relying on a stochastic-deterministic speech model have been proposed. The deterministic part of the signal corresponds to harmonically related sinusoids, often used to represent voiced speech. The stochastic part models signal components that are not captured by the deterministic components. In this paper, we consider this scenario under a new perspective yielding three main contributions. First, a pitch-synchronous signal representation is considered and shown to be advantageous for the estimation of the harmonic model parameters. Second, we model the harmonic amplitudes in voiced speech as random variables with frequency bin dependent Gamma distributions. Finally, distinct estimators for the different models of voiced speech, unvoiced speech, and speech absence are derived. To select from the arising estimates, we take into account the mutual impact of detection and estimation by proposing a binary decision framework that is derived from a Bayesian risk function. The resulting pitch-synchronous stochastic-deterministic estimator outperforms several benchmark methods in terms of speech intelligibility and perceived quality predicted by instrumental measures for various noise types and different signal-to-noise ratios.

U2 - 10.1109/TASLP.2017.2779405

DO - 10.1109/TASLP.2017.2779405

M3 - Article

VL - 26

SP - 436

EP - 450

JO - IEEE ACM Transactions on Audio Speech and Language Processing

JF - IEEE ACM Transactions on Audio Speech and Language Processing

SN - 2329-9290

IS - 2

ER -