Misc. Exam Questions Flashcards

1
Q

Outline the function of the basilar membrane in terms of human hearing.

A

The basilar membrane will vibrate at frequencies corresponding to input acoustic wave frequencies (formats) and at a place along the basilar membrane that is associated with these frequencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Outline when you would use a narrowband or wideband spectrogram to study a speech sample, with direct reference to the different acoustic-phonetic characteristics apparent in each.

Clearly define suitable window durations for their extraction, explaining why those window lengths are appropriate.

A
  1. Narrowband (20-40ms) and is used to emphasise the frequency changes. Vowels have strong harmonic content. Having strong frequency resolution means that the harmonic structure of the vocal fold vibration can be seen as horizontal stripes.
  2. Wideband (5ms) and is used for good temporal changes. Good for finding the formant frequencies.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

To build speaker models, they plan to use 2 minutes of data from each speaker to build each speaker-specific GMM using the EM algorithm, leaving the remainder for testing.

What are the shortcomings in this approach and what strategy would you instead recommend?

A
  1. The EM algorithm assumes that the features are independent to each other, which is not necessarily the case in real life. This is also not enough to robustly estimate all feautres
  2. Use a GMM-UBM or a end-to-end model (a deep learning model but only if there is enough data).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain how zero crossing rate allows you to classify speech into voiced and unvoiced speech. Comment on the accuracy of such an approach.

A

Unvoiced have energies above 1.5 kHz
Voiced have energies below

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In a HMM, explain what is the ‘hidden’ element in the model? What does it typically correspond to in a speech recognition system?

A

The ‘hidden’ element HMM refers to the sequence of internal states that you don’t directly observe. You only see the outputs they generate.

You hear someone speaking behind a curtain.
You can hear the sound (observations) but you can’t see what words they’re saying (hidden states).
Your job is to guess the word based on the sounds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

With reference to normal physiological ageing, outline the typical characteristics of pitch and formant frequencies in speech from a 75 year old female in contrast to the same female speaker at 35 years old.

A
  1. The vocal chords get weaker as one ages.
  2. This means that the pitch gets lower as one ages.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the mechanisms for speech production.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the mechanism for speech perception.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the three problems + solutions for HMMs?

A
  1. Compute the probability of an observation sequence given a specific HMM; Forward algorithm

Computing likelihood P(X|LAMDA); Forward algo

  1. Find the most likely sequence of hidden states that produced the given observation sequence; Viterbi algorithm

Finding most likely state; Viterbi Algo

-

  1. Learning, adjusting the HMM parameters to best fir the observed training data; Baum-Welch Algo/EM algo.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are N-gram models?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Three key properties of CNNs that help ASR?

A
  1. Locality
  2. Weight-sharing
  3. Pulling, frequency-sharing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does CTC loss allow for?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain SNR Gain in AVSR.

A
  1. AVEC performed worst but achieves the highest SNR gain
  2. Auto-AVSR performed the best but achieved the worst SNR gain.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How was LRS recorded?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Typical gap time in between speech?

A

200 ms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain the components in the source-filter model.

A
  1. Source - excitement; voiced, unvoiced sounds.
  2. Filter - vocal tract; resonant shaping of the raw sound by vocal tract.
17
Q

What are the two goals of a feature extraction method?

A
  1. Classification; leave only task-relevant information
  2. Coding; leave only perceptually important information.
18
Q

Why and how can we boost high frequency components when calculating MFCCs?

A

Why; speech has less energy at higher frequencies and therefore contains numerical problems in implementation.

How; time domain FIR filter

19
Q

When to use MFCCs vs. learned features?

A
  1. Use for lightweight models (can be used to train for GMMs)
  2. Learned features, good for end-to-end deep learning and applications with large data.
20
Q

What are the use of cepstral coefficients?

A
  1. The cepstrum is good for extracting the envelope (vocal tract resonances) and F0-information.
  2. Keep the filter.
21
Q

How can you use the autocorrelation function for pitch tracking.

A
  1. Get the ACF of the speech signal.
  2. The largest peak represents the fundamental frequency.
  3. Have a set of diminishing peaks at specific time intervals, and can find the pitch from the time between peaks.
22
Q

What is the use of cepstrum, more specifically MFCC?

A
  1. Mel Frequency Cepstral Coefficients
  2. Ubiquitous
  3. Features are decorrelated, meaning a diagonal covariance when modelled with GMM.