Asr Flashcards
(84 cards)
What is Librispeech?
LibrisSpeech large read speech 16 Khz 1000 hours of Audio books. They have a clean and other based on their WER
What is data segmentation in this context?
What are phonons and canons?
Phonemes and character in context
What are the 2 main categories and types of ASR models?
HMM-Based Model and end-to-end models
What is MustC
Introduced by Gangi et al. in MuST-C: a Multilingual Speech Translation Corpus
MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, from English to German, Spanish, French, Italian, Dutch, Portuguese, Romanian and Russian. The corpus consists of audio, transcriptions and translations of English TED talks, and it comes with a predefined training, validation and test split
What is an alternative in speech recognition to the encoder decoder architecture?
CTC the connectionist temporal classification ( it is also the name of the loss function)
Do we use all the input frames ? Why?
No we usually skip frames. This is important to stay up to datewith speech in online dictation. Also we assume the signal does not change in that frame. This is the criteria to choose how much to skip
What is the Word typical error rate on general on different types of datasets for speech recogition?
On read speech is ~ 2%
Conversations is between 5.8 and 11%
even more with accents noise etc
What is a disadvantage of end to end techniques?
They require more data to train to achieve the same performance of hybrid model. Bu they are usually not phonetic based so they are less expensive in that sense. They do not require a phonetic lexicons
On what examples are grapheme thought to be weaker than phoneme?
Proper noun and rare words but they are now pretty good.
What is a phoneme?
Unique, discreet unit of language that can be used to differentiate words.
You can also see as something that if you change it it can change the meaning of a word,
What is TIMIT?
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists of recordings of 630 speakers of 8 dialects of American English, each reading 10 phonetically rich sentences. It also comes with the word and phone-level transcriptions of the speech.
Phone boundaries are hand marked.
What are the parts of an HMM-based model and what do they do?
An HMM-based model is divided into three parts:acoustic, pronunciation and language model. In HMM based model, each model is independent of each other and plays a different role. While the acoustic model models the mapping between speech input and feature sequence, the pronunciation model maps between phonemes (or sub-phonemes) to graphemes, and the language model maps the character sequence to fluent final transcription.
What are typical datasets used in the team?
Accents: non native German speaker with accents
Apttek: colloquial phone conversation
Multidistances
Native German telling their stories but recorded at different distances
What is commonvoice?
Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages
What is likely the harder punctuation to model?
Commas
What does LVCSR stands for?
Large Vocabulary Speech Recognition (LVCSR)
LVCSR can be divided into two categories: HMM-based model and the end-to-end model.
What are the two main deficiencies of CTC models>?
- CTC cannot model interdependencies within the output sequence because it assumes that output elements are independent of each other. Therefore, CTC cannot learn the language model. The speech recognition network trained by CTC should be treated as only an acoustic model.
- CTC can only map input sequences to output sequences that are shorter than it. Thus, it is powerless for scenarios where output sequence is longer.
What is Switchboard?
Corpus of telephone conversation among strangers from early 90’s 2430 conversation on average of 6 mins with 240 hours at 8khz.
It has tons of linguistic labellings
Is the FFT spectrogram output? small enough?
No it is still too big so we applied a weighted average and we shrink the size we sum them up weighted on the Mel scale
What is a strong conditional assumption that a CTC model makes (expecially during inference?)
That the output at time t is independent from the time at each of the other times. So to get P(Y|X) you just need prod(p(at|X).
With an argmax you can get the inference.
When you do this you have to some over all the possible alignment that goes into the same final utteance.
Is the collapsing function of CTC many to one?
Yes different long utterances can be collpsed into the same final utterance. Indeed you have to sum over all of them in several places like the loss calculation.
What is the main disadvantage of phonetic-based models?
You need a phonetic lexicon created by expert and linguist which is very expensive and hard to scale


