Misc. Exam Questions Flashcards
Outline the function of the basilar membrane in terms of human hearing.
The basilar membrane will vibrate at frequencies corresponding to input acoustic wave frequencies (formats) and at a place along the basilar membrane that is associated with these frequencies.
Outline when you would use a narrowband or wideband spectrogram to study a speech sample, with direct reference to the different acoustic-phonetic characteristics apparent in each.
Clearly define suitable window durations for their extraction, explaining why those window lengths are appropriate.
- Narrowband (20-40ms) and is used to emphasise the frequency changes. Vowels have strong harmonic content. Having strong frequency resolution means that the harmonic structure of the vocal fold vibration can be seen as horizontal stripes.
- Wideband (5ms) and is used for good temporal changes. Good for finding the formant frequencies.
To build speaker models, they plan to use 2 minutes of data from each speaker to build each speaker-specific GMM using the EM algorithm, leaving the remainder for testing.
What are the shortcomings in this approach and what strategy would you instead recommend?
- The EM algorithm assumes that the features are independent to each other, which is not necessarily the case in real life. This is also not enough to robustly estimate all feautres
- Use a GMM-UBM or a end-to-end model (a deep learning model but only if there is enough data).
Explain how zero crossing rate allows you to classify speech into voiced and unvoiced speech. Comment on the accuracy of such an approach.
Unvoiced have energies above 1.5 kHz
Voiced have energies below
In a HMM, explain what is the ‘hidden’ element in the model? What does it typically correspond to in a speech recognition system?
The ‘hidden’ element HMM refers to the sequence of internal states that you don’t directly observe. You only see the outputs they generate.
You hear someone speaking behind a curtain.
You can hear the sound (observations) but you can’t see what words they’re saying (hidden states).
Your job is to guess the word based on the sounds.
With reference to normal physiological ageing, outline the typical characteristics of pitch and formant frequencies in speech from a 75 year old female in contrast to the same female speaker at 35 years old.
- The vocal chords get weaker as one ages.
- This means that the pitch gets lower as one ages.
Explain the mechanisms for speech production.
Explain the mechanism for speech perception.
What are the three problems + solutions for HMMs?
- Compute the probability of an observation sequence given a specific HMM; Forward algorithm
Computing likelihood P(X|LAMDA); Forward algo
- Find the most likely sequence of hidden states that produced the given observation sequence; Viterbi algorithm
Finding most likely state; Viterbi Algo
-
- Learning, adjusting the HMM parameters to best fir the observed training data; Baum-Welch Algo/EM algo.
What are N-gram models?
Three key properties of CNNs that help ASR?
- Locality
- Weight-sharing
- Pulling, frequency-sharing
What does CTC loss allow for?
Explain SNR Gain in AVSR.
- AVEC performed worst but achieves the highest SNR gain
- Auto-AVSR performed the best but achieved the worst SNR gain.
How was LRS recorded?
Typical gap time in between speech?
200 ms
Explain the components in the source-filter model.
- Source - excitement; voiced, unvoiced sounds.
- Filter - vocal tract; resonant shaping of the raw sound by vocal tract.
What are the two goals of a feature extraction method?
- Classification; leave only task-relevant information
- Coding; leave only perceptually important information.
Why and how can we boost high frequency components when calculating MFCCs?
Why; speech has less energy at higher frequencies and therefore contains numerical problems in implementation.
How; time domain FIR filter
When to use MFCCs vs. learned features?
- Use for lightweight models (can be used to train for GMMs)
- Learned features, good for end-to-end deep learning and applications with large data.
What are the use of cepstral coefficients?
- The cepstrum is good for extracting the envelope (vocal tract resonances) and F0-information.
- Keep the filter.
How can you use the autocorrelation function for pitch tracking.
- Get the ACF of the speech signal.
- The largest peak represents the fundamental frequency.
- Have a set of diminishing peaks at specific time intervals, and can find the pitch from the time between peaks.
What is the use of cepstrum, more specifically MFCC?
- Mel Frequency Cepstral Coefficients
- Ubiquitous
- Features are decorrelated, meaning a diagonal covariance when modelled with GMM.