Conversational Speech Recognition Needs Data? Experiments with Austrian German

Julian Linke; Philip N. Garner; Gernot Kubin; Barbara Schuppler

Conversational Speech Recognition Needs Data? Experiments with Austrian German

Julian Linke, Philip N. Garner, Gernot Kubin, Barbara Schuppler

Institut für Signalverarbeitung und Sprachkommunikation (4420)

Publikation: Konferenzbeitrag › Paper › Begutachtung

Abstract

Conversational speech represents one of the most complex of automatic speech recognition (ASR) tasks owing to the high
inter-speaker variation in both pronunciation and conversational dynamics. Such complexity is particularly sensitive to
low-resourced (LR) scenarios. Recent developments in self-supervision have allowed such scenarios to take advantage of large
amounts of otherwise unrelated data. In this study, we characterise an (LR) Austrian German conversational task. We begin
with a non-pre-trained baseline and show that fine-tuning of a model pre-trained using self-supervision leads to improvements
consistent with those in the literature; this extends to cases where a lexicon and language model are included. We also show
that the advantage of pre-training indeed arises from the larger database rather than the self-supervision. Further, by use
of a leave-one-conversation out technique, we demonstrate that robustness problems remain with respect to inter-speaker
and inter-conversation variation. This serves to guide where future research might best be focused in light of the current
state-of-the-art.

Originalsprache	englisch
Seiten	4684–4691
Seitenumfang	8
Publikationsstatus	Veröffentlicht - 2022

Schlagwörter

Speech Recognition
Conversational Speech
Austrian German
Low-Resource
Wav2vec2.0
Kaldi

FWF - CLCS_2 - Cross-layer Prosodie Modelle für Spontansprache
Schuppler, B.
1/10/18 → 30/11/21
Projekt: Forschungsprojekt

Dieses zitieren

@conference{c6fe4431f2574571a3f5d532118560a4,

title = "Conversational Speech Recognition Needs Data? Experiments with Austrian German",

abstract = "Conversational speech represents one of the most complex of automatic speech recognition (ASR) tasks owing to the high inter-speaker variation in both pronunciation and conversational dynamics. Such complexity is particularly sensitive to low-resourced (LR) scenarios. Recent developments in self-supervision have allowed such scenarios to take advantage of largeamounts of otherwise unrelated data. In this study, we characterise an (LR) Austrian German conversational task. We begin with a non-pre-trained baseline and show that fine-tuning of a model pre-trained using self-supervision leads to improvements consistent with those in the literature; this extends to cases where a lexicon and language model are included. We also showthat the advantage of pre-training indeed arises from the larger database rather than the self-supervision. Further, by use of a leave-one-conversation out technique, we demonstrate that robustness problems remain with respect to inter-speaker and inter-conversation variation. This serves to guide where future research might best be focused in light of the current state-of-the-art.",

keywords = "Speech Recognition, Conversational Speech, Austrian German, Low-Resource, Wav2vec2.0, Kaldi",

author = "Julian Linke and Garner, {Philip N.} and Gernot Kubin and Barbara Schuppler",

year = "2022",

language = "English",

pages = "4684–4691",

}

TY - CONF

T1 - Conversational Speech Recognition Needs Data? Experiments with Austrian German

AU - Linke, Julian

AU - Garner, Philip N.

AU - Kubin, Gernot

AU - Schuppler, Barbara

PY - 2022

Y1 - 2022

N2 - Conversational speech represents one of the most complex of automatic speech recognition (ASR) tasks owing to the high inter-speaker variation in both pronunciation and conversational dynamics. Such complexity is particularly sensitive to low-resourced (LR) scenarios. Recent developments in self-supervision have allowed such scenarios to take advantage of largeamounts of otherwise unrelated data. In this study, we characterise an (LR) Austrian German conversational task. We begin with a non-pre-trained baseline and show that fine-tuning of a model pre-trained using self-supervision leads to improvements consistent with those in the literature; this extends to cases where a lexicon and language model are included. We also showthat the advantage of pre-training indeed arises from the larger database rather than the self-supervision. Further, by use of a leave-one-conversation out technique, we demonstrate that robustness problems remain with respect to inter-speaker and inter-conversation variation. This serves to guide where future research might best be focused in light of the current state-of-the-art.

AB - Conversational speech represents one of the most complex of automatic speech recognition (ASR) tasks owing to the high inter-speaker variation in both pronunciation and conversational dynamics. Such complexity is particularly sensitive to low-resourced (LR) scenarios. Recent developments in self-supervision have allowed such scenarios to take advantage of largeamounts of otherwise unrelated data. In this study, we characterise an (LR) Austrian German conversational task. We begin with a non-pre-trained baseline and show that fine-tuning of a model pre-trained using self-supervision leads to improvements consistent with those in the literature; this extends to cases where a lexicon and language model are included. We also showthat the advantage of pre-training indeed arises from the larger database rather than the self-supervision. Further, by use of a leave-one-conversation out technique, we demonstrate that robustness problems remain with respect to inter-speaker and inter-conversation variation. This serves to guide where future research might best be focused in light of the current state-of-the-art.

KW - Speech Recognition

KW - Conversational Speech

KW - Austrian German

KW - Low-Resource

KW - Wav2vec2.0

KW - Kaldi

M3 - Paper

SP - 4684

EP - 4691

ER -

Conversational Speech Recognition Needs Data? Experiments with Austrian German

Abstract

Schlagwörter

Fingerprint

Projekte

FWF - CLCS_2 - Cross-layer Prosodie Modelle für Spontansprache

Dieses zitieren