Misc. Exam Questions Flashcards

Question 1

Q

Outline the function of the basilar membrane in terms of human hearing.

Answer

A

The basilar membrane will vibrate at frequencies corresponding to input acoustic wave frequencies (formats) and at a place along the basilar membrane that is associated with these frequencies.

Question 2

Q

Outline when you would use a narrowband or wideband spectrogram to study a speech sample, with direct reference to the different acoustic-phonetic characteristics apparent in each.

Clearly define suitable window durations for their extraction, explaining why those window lengths are appropriate.

Answer

A

Narrowband (20-40ms) and is used to emphasise the frequency changes. Vowels have strong harmonic content. Having strong frequency resolution means that the harmonic structure of the vocal fold vibration can be seen as horizontal stripes.
Wideband (5ms) and is used for good temporal changes. Good for finding the formant frequencies.

Question 3

Q

To build speaker models, they plan to use 2 minutes of data from each speaker to build each speaker-specific GMM using the EM algorithm, leaving the remainder for testing.

What are the shortcomings in this approach and what strategy would you instead recommend?

Answer

A

The EM algorithm assumes that the features are independent to each other, which is not necessarily the case in real life. This is also not enough to robustly estimate all feautres
Use a GMM-UBM or a end-to-end model (a deep learning model but only if there is enough data).

Question 4

Q

Explain how zero crossing rate allows you to classify speech into voiced and unvoiced speech. Comment on the accuracy of such an approach.

Answer

A

Unvoiced have energies above 1.5 kHz
Voiced have energies below

Question 5

Q

In a HMM, explain what is the ‘hidden’ element in the model? What does it typically correspond to in a speech recognition system?

Answer

A

The ‘hidden’ element HMM refers to the sequence of internal states that you don’t directly observe. You only see the outputs they generate.

You hear someone speaking behind a curtain.
You can hear the sound (observations) but you can’t see what words they’re saying (hidden states).
Your job is to guess the word based on the sounds.

Question 6

Q

With reference to normal physiological ageing, outline the typical characteristics of pitch and formant frequencies in speech from a 75 year old female in contrast to the same female speaker at 35 years old.

Answer

A

The vocal chords get weaker as one ages.
This means that the pitch gets lower as one ages.

Question 7

Q

Explain the mechanisms for speech production.

Question 8

Q

Explain the mechanism for speech perception.

Question 9

Q

What are the three problems + solutions for HMMs?

Answer

A

Compute the probability of an observation sequence given a specific HMM; Forward algorithm

Computing likelihood P(X|LAMDA); Forward algo

Find the most likely sequence of hidden states that produced the given observation sequence; Viterbi algorithm

Finding most likely state; Viterbi Algo

-

Learning, adjusting the HMM parameters to best fir the observed training data; Baum-Welch Algo/EM algo.

Question 10

Q

What are N-gram models?

Question 11

Q

Three key properties of CNNs that help ASR?

Answer

A

Locality
Weight-sharing
Pulling, frequency-sharing

Question 12

Q

What does CTC loss allow for?

Question 13

Q

Explain SNR Gain in AVSR.

Answer

A

AVEC performed worst but achieves the highest SNR gain
Auto-AVSR performed the best but achieved the worst SNR gain.

Question 14

Q

How was LRS recorded?

Question 15

Q

Typical gap time in between speech?

Question 16

Q

Explain the components in the source-filter model.

Answer

A

Source - excitement; voiced, unvoiced sounds.
Filter - vocal tract; resonant shaping of the raw sound by vocal tract.

Question 17

Q

What are the two goals of a feature extraction method?

Answer

A

Classification; leave only task-relevant information
Coding; leave only perceptually important information.

Question 18

Q

Why and how can we boost high frequency components when calculating MFCCs?

Answer

A

Why; speech has less energy at higher frequencies and therefore contains numerical problems in implementation.

How; time domain FIR filter

Question 19

Q

When to use MFCCs vs. learned features?

Answer

A

Use for lightweight models (can be used to train for GMMs)
Learned features, good for end-to-end deep learning and applications with large data.

Question 20

Q

What are the use of cepstral coefficients?

Answer

A

The cepstrum is good for extracting the envelope (vocal tract resonances) and F0-information.
Keep the filter.

Question 21

Q

How can you use the autocorrelation function for pitch tracking.

Answer

A

Get the ACF of the speech signal.
The largest peak represents the fundamental frequency.
Have a set of diminishing peaks at specific time intervals, and can find the pitch from the time between peaks.

Question 22

Q

What is the use of cepstrum, more specifically MFCC?

Answer

A

Mel Frequency Cepstral Coefficients
Ubiquitous
Features are decorrelated, meaning a diagonal covariance when modelled with GMM.