Asr Flashcards

Question

What are the component of an ASR system?

Answer 1

Feature Extraction: It converts the speech signal into a sequence of acoustic feature vectors. These observations should be compact and carry sufficient information for recognition in the later stage. Acoustic Model: It Contains a statistical representation of the distinct sounds that make up each word in the Language Model or Grammar. Each distinct sound corresponds to a phoneme. Language Model: It contains a massive list of words and their probability of occurrence in a given sequence. Decoder: It is a software program that takes the sounds spoken by a user and searches the acoustic Model for the equivalent sounds. When a match is made, the decoder determines the phoneme corresponding to the sound. It keeps track of the matching phonemes until it reaches a pause in the user's speech. It then searches the language model for the equivalent series of phonemes. If a match is made, it returns the text of the corresponding word or phrase to the calling program.

Answer 2

It allows you to be hands free, it is a more natural way to communicate and improve accesability

Answer 3

The concept was invented for ASR. It means to group phone into triple a leading part a stable part and a trailing part Note that this why senones depends on context. The leading and trailing part depend on the ohones beforesnd after

Answer 4

For readability but not only, it is crucial for NLP better understanding.

Answer 5

``` The training process is complex and difficult to be globally optimized. HMM-based model often uses different training methods and data sets to train different modules. Each module is independently optimized with their own optimization objective functions which are generally different from the true LVCSR performance evaluation criteria. So the optimality of each module does not necessarily bring global optimality. Conditional independence assumptions. To simplify the model’s construction and training, the HMM-based model uses conditional independence assumptions within HMM and between different modules. This does not match the actual situation of LVCSR. ```

Answer 6

The idea is to have an output for every input, i.e every audio frame.then we collpase the output into the actual final shorter sentence.

Answer 7

Introduced by Warden in Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems.

Answer 8

Although HMM-DNN provides still state-of-the-art results, the role played by DNN is limited. It is mainly used to model the posterior state probability of HMM’s hidden state. The time-domain feature is still modeled by HMM. **_When attempting to model time-domain features using RNN or CNN instead of HMM, it faces a data alignment proble_**m: both RNN and CNN’s _loss functions are defined at each point in the sequence, so in order to be able to perform training, it is necessary to know the alignment relation between RNN output sequence and target sequence._

Answer 9

Yes because due to the conditional independece made by the algorithm it does not lean a LM at all during training

Answer 10

Getting the combination of phones can be combinatorially hard so we cluster them this clustering can also be learned by the model

Answer 11

No. We use senones a triple with a trailing and leading part

Answer 12

Because input and output have very different lenghts. Note that this lenght is extreme for speech so we need some form of compressions

Answer 13

There are several sources of variability Style continuous speech or different words conversation reading dictation etc Environment noise distance from microphone Rate of speech Accents

Answer 14

The RNN-transducer has many similarities with CT: their main goals are to solve the forced segmentation alignment problem in speech recognition; they both introduce a “blank” label; they both calculate the probability of all possible paths and aggregate them to get the label sequence. However, their path generation processes and the path probability calculation methods are completely different. This gives rise to the advantages of RNN-transducer over CTC.

Answer 15

no not usually

Answer 16

A search optimization problem we want to find among all words the most likely given the input but there are too many so we use models to compute probability and prune the search

Answer 17

Skype Cortana cognitive office dictation etc and teams.

Answer 18

* **CTC-based**: It first enumerates all possible hard alignments. Then, it achieves soft alignment by aggregating these hard alignments. CTC assumes that output labels are independent of each other when enumerating hard alignments. * **RNN-transduce**r: It also enumerates all possible hard alignments and then aggregates them for soft alignment. But unlike CTC, RNN-transducer does not make independent assumptions about labels when enumerating hard alignments. Thus, it is different from CTC in terms of path definition and probability calculation. * **Attention-based**: This method no longer enumerates all possible hard alignments, but uses attention mechanism to directly calculate the soft alignment information between input data and output label.

Answer 19

You go from a 512 768 FFT to around 40+ dimensionsiions

Answer 20

It is a sequence labelling tasks with punctuation classes , . ' ? and

Answer 21

Because the lenght of the input, the mel log feature and the output, letters or word is very different. If extreme you need compression

Answer 22

CTC mainly overcomes the following **two difficulties** for end-to-end LVCSR models: **Data alignment problem**. CTC no longer needs to segment and align training data. This solves the alignment problem so that DNN can be used to model time-domain features, which greatly enhances DNN’s role in LVCSR tasks. **Directly output the target transcriptions**. Traditional models often output phonemes or other small units, and further processing is required to obtain the final transcriptions. CTC eliminates the need for small units and direct output in final target form, greatly simplifying the construction and training of an end-to-end model.

Answer 23

Around 20 60 phonemes

Answer 24

Yes, think about a case were you predict a long sentence while the ground truth is just one word.

Answer 25

* **Out-of-vocabulary (OOV) errors**: Current state-of-the-art speech recognizers have closed vocabularies. This means that they are incapable of recognizing words outside their training vocabulary. Besides misrecognition, the presence of an out-of-vocabulary word in an input utterance causes the system to err to a similar word in its vocabulary. Special techniques for handling OOV words have been developed for HMM-GMM and neural ASR systems (see, e.g., Zhang, 2019). * * **Homophone substitution**: These errors can occur if more than one lexical entry has the same pronunciation (phone sequence), i.e., they are homophones. While decoding, homophones may be confused with one another, causing errors. In general, _a well-functioning language model should disambiguate homophones based on the context._ * **Language model bias**: Because of an undue bias towards the language model (effected by a high relative weight on the language model), the decoder may be forced to reject the true hypothesis in favor of a spurious candidate with high language model probability. These errors may occur along with analogous acoustic model bias. * Multiple acoustic problems: This is a broad category of errors comprising those due to bad pronunciation entries; disfluency, mispronunciation by the speaker himself/herself, or errors made by acoustic models (possibly due to acoustic noise, data mismatch between training and usage etc.).

Answer 26

It varies from 10 to 25 Ms we are always assuming the signal is stationary

Answer 27

Because it was hard to go from phone to words

Answer 28

There are too many 40^3 = 60K but only 40K will be used How many triphones are there? Consider a 40 phone system. 403 = 64 000 possible triphones. In a cross-word system maybe 50 000 can occur Number of parameters: 50 000 three-state HMMs, with 10 component Gaussian mixtures per state: 1.5M Gaussians 39-dimension feature vectors (12 MFCCs + energy), deltas and accelerations Assuming diagonal Gaussians: about 790 parameters/state Total about 118 million parameters! We would need a very large amount of training data to train such a system to enable robust estimation of all parameters to ensure that all possible triphones are observed (more than once) in the training data

Answer 29

No they compute the P(q|x) they are discriminative, see the image below. Luckily you can use the Bayes theorem to invert that probability

Answer 30

The precise temporal structure allows you to avoid any explicit hand segmentation in terms of speech units like phonemes. So excellent for continuous speech. there are a lot of assumptions behind it but it works well and with the Viterbi implementation, it is pretty fast

Answer 31

Advantages of NN: Can easily model correlated features Correlated feature vector components (eg spectral features) Input context – multiple frames of data at input More flexible than GMMs – not made of (nearly) local components); GMMs inefficient for non-linear class boundaries NNs can model multiple events in the input simultaneously – different sets of hidden units modelling each event; GMMs assume each frame generated by a single mixture component. NNs can learn richer representations and learn ‘higher-level’ features (tandem, posteriorgrams, bottleneck features)

Answer 32

Disadvantages of NNs in the 1990s: Context-independent (monophone) models, weak speaker adaptation algorithms NN systems less complex than GMMs (fewer parameters): RNN – \< 100k parameters, MLP – ∼ 1M parameters Computationally expensive - more difficult to parallelise training than GMM sstems **Now not like that anymore**

Answer 33

Posterior probability estimation Consider a neural network trained as a classifier – each output corresponds to a class. When applying a trained network to test data, it can be shown that the value of output corresponding to class j given an input xt , is an estimate of the posterior probability P(qt = j|xt). (This is because we have softmax outputs and use a cross-entropy loss function) Using Bayes Rule we can relate the posterior P(qt = j|xt) to the likelihood p(xt |qt = j) used as an output probability in an HMM: P(qt |xt) = p(xt |qt = j)P(qt = j) p(xt) ASR Lecture 11 Neural Networks for Acoustic Modelling 2: HMM/DNN 7 Scaled likelihoods If we would like to use NN outputs as output probabilities in an HMM, then we would like probabilities (or densities) of the form p(x|q) – likelihoods. We can write scaled likelihoods as: P(qt = j|xt) p(qt = j) = p(xt |qt = j) p(xt) Scaled likelihoods can be obtained by “dividing by the priors” – divide each network output P(qt = j|xt) by P(qt), the relative frequency of class j in the training data Using p(xt |qt = j)/p(xt) rather than p(xt |qt = j) is OK since p(xt) does not depend on the class j Use the scaled likelihoods obtained from a neural network in place of the usual likelihoods obtained from a GMM

Answer 34

Because they are trained to maximized the likelihood. They want P(X|M) where M is the best HMM(set of states) they do not minimize for the wrong states, they are not discriminative.

Answer 35

Time delay Neural network

Answer 36

They are used in the hybrid HMM/DNN models in place of DNN or RNN. They are time delay neural networks The input at time t is connected with the neuron inputs at times t-n where is the context. So they are very powerful are modelling time series and they are a bit faster to train compared to RNN The TDNN is essentially a 1-d convolutional neural network without pooling and with dilations. see Time Delay Neural Network – KaleidoEscape – Linguist turned Programmer

Answer 37

* GMMs: filter bank features (spectral domain) not used as they are strongly correlated with each other – would either require full covariance matrix Gaussians many diagonal covariance Gaussians * * DNNs do not require the components of the feature vector to be uncorrelated Can directly use multiple frames of input context (this has been done in NN/HMM systems since 1990, and is crucial to make them work well) Can potentially use feature vectors with correlated components (e.g. filter banks) * Experiments indicate that mel-scaled filter bank features (FBANK) result in greater accuracy than MFCCs

Answer 38

Yes but you can use an iterative approach ANNs trained for classification require supervision (labeled targets for each pattern). An early problem in applying ANN methods to speech recognition was the apparent requirement of hand-labeled frames for ANN training. Since the ANN outputs can be used in the dynamic programming for global decoding (after division by the prior probabilities), it is possible to use embedded Viterbi training to iteratively optimize both the segmentation and the ANN parameters. In this procedure, illustrated in Fig. 8, each ANN training is done using labels from the previous Viterbi alignment. In turn, an ANN is used to estimate training set state probabilities, and dynamic programming given the training set models is used to determine the new labels for the next ANN training. Of course, as for standard HMM Viterbi training, one must start this procedure somewhere, and also have a consistent criterion for stopping. Many initializations can be used

Answer 39

being able to generalize to condition you were not trained on in terms of accents noise channel environment distance etc

Answer 40

**Neural networks, both feed-forward and recurrent, can be only used for frame-wise classification of the input audio.** This problem can be addressed using: * Hidden Markov Models (HMMs) to get the alignment between the input audio and its transcribed output. * Connectionist Temporal Classification (CTC) loss, which is the most common technique.

Answer 41

Large unsupervised pretrained model.

Answer 42

1M agaist the 1K data of labelledd data used typically

Answer 43

They enlarge the semi supervised (not humanly annotated dataset) from 30K to 680K getting closer to the unsupervised size of 1M used by wave2vec

Answer 44

Is a speech recognition model from Open AI They got 680K of data from the internet (weakly supervised) multilingual and multitask. The goal is to create a robust zero-shot model We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.

Answer 45

**_~6000_** As demonstrated by Narayanan et al. (2018), Likhomanenko et al. (2020), and Chan et al. (2021) speech recognition systems that are pre-trained in a supervised fashion across many datasets/domains exhibit higher robustness and generalize much more effectively to held-out datasets than models trained on a single source. These works achieve this by combining as many existing high-quality speech recognition datasets as possible. However, there is still only a moderate amount of this data easily available. SpeechStew (Chan et al., 2021) mixes together 7 pre-existing datasets totalling 5,140 hours of supervision. While not insignificant, this is still tiny compared to the previously mentioned 1,000,000 hours of unlabeled speech data utilized in Zhang et al. (2021).

Answer 46

Multilingual LibriSpeech (MLS) (Pratap et al., 2020b) and VoxPopuli (Wang et al., 2021)

Answer 47

VoxPopuli is a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours.

Answer 48

XLS-R and mSLAM XLS-R (Babu et al., 2021) and mSLAM (Bapna et al., 2022)

Answer 49

covost2 · Datasets at Hugging Face Dataset Summary CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. The dataset is created using Mozillas open-source Common Voice database of crowdsourced voice recordings. There are 2,900 hours of speech represented in the corpus. Supported Tasks and Leaderboards speech-translation: The dataset can be used for Speech-to-text translation (ST). The model is presented with an audio file in one language and asked to transcribe the audio file to written text in another language. The most common evaluation metric is the BLEU score. Examples can be found at https://github.com/pytorch/fairseq/blob/master/examples/speech\_to\_text/docs/covost\_example.md . Languages The dataset contains the audio, transcriptions, and translations in the following languages, French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian, Chinese, Welsh, Catalan, Slovenian, Estonian, Indonesian, Arabic, Tamil, Portuguese, Latvian, and Japanese.

Answer 50

Fleurs dataset (Conneau et al., 2022)

Answer 51

You can inject noise directly in the wave form: There are 2 white noise, pub noise and I guess you can do more like injecting one data over another at lower SNR You can speed up the waveform cut out frame

Answer 52

Contextualized ASR systems fall into two categories: word-level context and utterance-level context. Word-level context aims to enhance the recognition accuracy of rare words, while utterance-level context carries more sophisticated information such as topic and logical relationships.

Answer 53

1) you optimize jointly for lexical and display so more simple and likely more effective accurate 2) you can leverage entity rich human caption model data for licare for a lot of data

Asr Flashcards

(84 cards)