07 - End-2-End Speech Recognition Flashcards

1
Q

Let us start with a reminder. What is an end-2-end system or model?

A

A sytem or model that takes in an input and directly models an output without relying on intermediate stages or processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

HMM-based problems are normally split into 3 separate models, which are?

A

The Acoustic Model (AM), the Language Model (LM) and the Pronunciation Model (PM).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

HMM is quick and efficient and does not require expert knowledge. True or false?

A

False. HMM’s require expert knowledge and is very time consuming because of the 3 models that underly in the structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In the acoustic model (that is the sound signal input) what are the main issues (there are 3)?

A
  1. The input is of variable length.
  2. The input is often much larger than the output.
  3. We do not know how the input audio features align with the output characters.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the two main architectures used to solve the ‘alignment problem’ in end-2-end models?

A

The CTC (Connectionist Temporal Classification) and the ‘seq2seq-attention’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The CTC was proposed in 2006 as a model that can train an acoustic model without requiring segmentation and alignment. True or false?

A

True.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

An original approach in speech recognition outputted phonemes and NOT words. Is it still an end-2-end model when the output is phonemes?

A

NO. It is not the complete process. You would have to transfer the phonemes into words afterwards thus relying on an intermediate process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The CTC does not allow for identifying consecutive character repetitions. For example in the word ‘hello’ (2 l’s). What special token does CTC introduce to combat this?

A

The blank (ϵ) token.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The CTC mapping makes use of the blank token (ϵ) to comprehend what?

A

An instance where a word has two consecutive letters. This is because input and output can have different lengths.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The probability of an alignment in the CTC is what?

A

The dot-product of the probabilities at each time step.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

CTC can perform inference, and uses two search mechanisms. These are?

A

The greedy search and the beam search

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the ‘Greedy Search’?

A

This takes the most likely output at each time-step. It will give us the alignment with the highest probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the ‘Beam Search’?

A

It computes a new set of hypotheses at each input step with all possible combinations, but keeps only the top candidates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

All network outputs of the CTC are conditionally dependent. True or false?

A

False. They are conditionally independent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Is CTC (used alone) a real end-2-end approach?

A

No. Because in order to obtain good performance it relies on the use of external language models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The ‘seq-2-seq-attention’ architecture is made of 3 blocks. These are?

A

An encoder
An attention mechanism
A decoder

17
Q

In ‘seq-2-seq-attention’ what is the role of the encoder?

A

The encoder is analogous to an Acoustic Model (AM) and transforms speech features into a higher level representation.

18
Q

In ‘seq-2-seq-attention’ what is the role of the attention mechanism?

A

This is analogous to an Alignment Model that chooses encoded frames that are relevant to produce output.

19
Q

In ‘seq-2-seq-attention’ what is the role of the decoder?

A

It is analogous to a Language Model (LM) that predicts each token as a function of the previous predictions and outputs it (after having applied softmax).

20
Q

Often dimensionality can be a problem in the encoder part of ‘seq-2-seq-attention’. It is common to use what NN in the first layers?

A

A Convolution Neural Network. After that you might apply BLSTM’s (bi-directional LSTM) to reduce the encoded features.

21
Q

The first example of a ‘seq-2-seq-attention’ model was the Listen, Attend and Spell. The output here were characters and not words. Was is still an end-2-end model if the desired output was characters?

A

Yes, but if the desired output was words, then no.

22
Q

In ‘seq-2-seq-attention’ a problem is that it is easily affected by noise. True or false?

A

True

23
Q

In ‘seq-2-seq-attention’ it is better to train with longer input sequences first and then shorter input sequences. True or false?

A

False. You want to train with the short input sequences first.

24
Q

An encoder-decoder approach can be ‘streamed’. True or false?

A

False. The encoder needs to complete the input, before the decoder can start working.

25
Q

In CTC, seq2seq and the hybrid CTC/Attention, the ‘beam search’ or the ‘greedy search’ is more common?

A

The Beam Search

26
Q

The ‘Hybrid CTC/attention decoding’ is more robust due to?

A

It is better with appropriate alignments in noisy environments. It uses Joint decoding during recognition.

27
Q

What is the ESPnet Toolkit used for?

A

It is dedicated to end-2-end speech processing!

28
Q

What can the RNN-T do that the CTC and other cannot?

A

It can stream due to its use of a joint network.

29
Q

Summary question: End-2-end is the current trend in the ASR field. True or false?

A

True

30
Q

End-2-end systems are better when there are limited resources compared to the ‘conventional’ HMM-DNN systems. True or false?

A

False