ASR systems have originally been designed to cope with carefully pronounced speech. As a consequence, these systems cannot deal well with spontaneous, conversational speech. Read and conversational speech are different in many aspects. On the linguistic level, conversational speech contains disfluencies and many utterances that might be considered as ungrammatical'. On the phonetic level, a much higher degree of pronunciation variation is observed in spontaneous than in read speech. Words are more often acoustically reduced compared to their full pronunciations, such that a word like yesterday may sound like yeshay or a German word like haben my sound like ham. Since most real world applications of ASR systems require the recognition of spontaneous speech (e.g., dialogue systems, voice input aids for physically disabled, medical dictation systems, etc.), the investigation of new methods to model every-day speech has received a lot of attention among speech technologists.
Also in the linguistic and psycholinguistic domain, casual conversations are studied on the search for an answer to how every-day speech production and comprehension works. Their studies have indicated that certain higher level linguistic functions and structures of utterances condition the details of their pronunciation. It is likely that the kind of analysis that is becoming feasible with the growing availability of large speech corpora will bring to light yet unknown factors that affect pronunciation variation.
The research envisioned in this proposal is designed to increase our knowledge about spontaneous, conversational speech and to use this knowledge to improve Automatic Speech Recognition (ASR) systems. The first objective is to identify which higher level linguistic structures and functions condition pronunciation variation by means of quantitative phonetic analyses. Studies will be carried out on Dutch and on Austrian German material, which will allow to draw conclusions about which findings are language specific and which are characteristic for conversational speech in general. The second objective is to improve ASR technology by incorporating the gained knowledge about the conditions for pronunciation variation. Most ASR systems still deal with acoustic and linguistic information independently of each other. In contrast, I propose a Cross-layer pronunciation modeling technique, which (1) makes use of the gained knowledge about the effects of several layers of linguistic structures and functions on pronunciation variation, and (2) which means that the recognizer makes use of lexicons in more than just one layer of its architecture. Additional deliverables of this project are the collected speech material along with the created tools for its automatic annotation, which both would be of great value for future studies of linguists and engineers