Machine Translation and Encoder-Decoder Models Flashcards

1
Q

What is machine translation?

A

It is the use of computers to translate one language to another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What machine translation models exist?

A

Statistical phrase alignment models, Encoder-Decoder models and Transformer models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What type of task is machine translation?

A

It is a sequence to sequence task (seq2seq)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the input, output and their lengths for a seq2seq task?

A

The input X is a sequence of words, the output Y is a sequence of words, but the length of X may not necessarily equal the length of Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Besides machine translation, what are some other seq2seq tasks?

A

Question → Answer

Sentence → Clause

Document → Abstract

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What do universal aspects mean in regards to the human language?

A

These are aspects that are true, or statistically mostly true for all languages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some examples of universal aspects in the human language?

A

Nouns/Verbs, Greetings, Politeness/Rude

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are translation divergences?

A

These are areas where languages differ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some examples of translation divergences?

A

Idiosyncrasies and lexical differences

Systematic differences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the study of translation divergences called?

A

Linguistic Typology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Word Order Typology?

A

It is a way of ordering words in different ways, these can include:

  • Subject-Verb-Object (SVO)
  • Subject-Object-Verb (SOV)
  • Verb-Subject-Object (VSO)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the Encoder-Decoder model?

A

For an input sequence, we have an encoder, that encodes the input to a context vector, which is then sent to a decoder that generates the output sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What can an encoder be?

A

LSTM, GRU, CNN, Transformers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a context vector?

A

It is the last hidden layer of the encoder, which is used as the input to the decoder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does a language model try to do?

A

Predict the next word in a sequence Y based on the previous word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is a translation model different to a language model?

A

It predicts the next word in the sequence Y based on the previous target word AND the full source sequence X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Explain how the encoder-decoder model shown in the image works

A

We have a single hidden layer that takes as an input the embeddings of the source text, we then have a separator and the predicted words based on its training. The predicted words are used in the prediction of the next word until the end is reached. The key is that the final hidden layer of the last input word is fed into the decoder which predicts the target words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

By using the hidden layer at the end of the sentence in the machine translation model, what is avoided?

A

It avoids the word typology problems as we have full knowledge of where the sentence starts and ends, meaning that before we start to translate the first word, we know what the end word is

19
Q

How is the encoder trained in machine translation models?

A

The input words are embedded using an embedding layer and are fed in one at a time to the encoder until the full input has been seen.

20
Q

Which state is the final hidden layer of the encoder fed into the decoder?

A

It is fed into every single state of the decoder

21
Q

What are the inputs at each step in the decoder?

A

The context vector c (final hidden state of encoder), the previous output, yt-1 and the previous hidden state of the decoder ht-1

22
Q

What is the typical loss function for a machine translation model?

A

It is a cross entropy loss function

23
Q

What is used during training of machine translation but not inference?

A

Teacher forcing, as we want to ensure we train on exact translations of the words

24
Q

What is the total loss per sentence in machine translation?

A

It is the averages loss across all target words

25
Q

Why is a separator token added to the start of the target sequence Y?

A

It is because the decoder needs a previous word embedding to compute a prediction, as nothing has been predicted, we need an initial input entry for the first word

26
Q

Given that the weights of previous tokens decay as the sequence is processed, how is access to previous hidden states without diminishing the weights achieved?

A

It is achieved using attention

27
Q

What can be used instead of the static context vector that has been shown to be better?

A

An attention vector

28
Q

What are some types of attention layers?

A

Dot-product attention

Additive attention

Self-attention

29
Q

What does the image show?

A

It shows how the context vector is replaced with the attention vector. Each of the hidden states are multiplied by some attention weight, and then a weighted sum is used to give us the context vector

30
Q

What problem can using an attention vector over the last encoder layer overcome?

A

In some languages, if the start/end of the sentence was more important, then some models would reverse the sentence to compensate for that. Attention avoids this through learning which parts to focus on

31
Q

What is a problem of using the argmax function with machine translation?

A

There is no hindsight to allow us to revisit choices at previous time steps if we have got something wrong.

32
Q

What is greedy decoding?

A

It is where argmax is used such that we cannot revisit previous choices - we are stuck with the choice we made

33
Q

What does Beam Search allow for?

A

It allows for you to revisit previous options if they become better than the current option that is currently being pursued

34
Q

How does a beam search decoder work?

A

It keeps a memory of the k-best sequence options (hypotheses) at any decoding step

35
Q

What is the beam width?

A

It is the memory size k

36
Q

What happens at each step in a beam search?

A

All k hypotheses are extended by V predicted tokens, which is the possible tokens.

The best k sequences from k x V hypotheses are then selected for memory

37
Q

Explain what the image below shows, using a beam width of 2.

A

We start with the start of the sentence. We first take the most probable two steps, so we have two sequences ‘start arrived’ and ‘start the’. At the next step, we have 4 possible options that we can choose. By using logs, we can add together the probabilities of each word in the sequence. In doing so, we can see that the two next most probable sequences from where we currently sit are ‘start the green’ and ‘start the witch’. We repeat this again, and see that the next most probable sequences are ‘start the green witch’ and ‘start the witch arrived’. This is repeated until we get to the end.

38
Q

What is the typical beam width in actual systems, and what is the problem of going above this size?

A

The typical size is 4 to 10, bigger than 10 can take a long time to decode which can slow the cycles.

39
Q

What type of vocabulary is used in a seq2seq model?

A

A fixed vocabulary (e.g. 50k limited by GPU memory) although BPE or WordPiece can also be used

40
Q

What decoder should be used to get fast results, and what should be used to get the best results?

A

Greedy decoder for fast, Beam decoder for best

41
Q

What are the training data options for machine translation?

A

Parallel Corpus - A corpus of two languages containing sentences that are aligned

Monolingual Corpus - These are stacks of posts in one language and stacks of posts in another, although there is no connection between the two. They are very large but require interesting techniques

42
Q

What is backtranslation?

A

Train a model using a parallel corpus, apply the model to the massive monolingual corpus and use this to train the model as it is sentence aligned and compare

43
Q

How can Machine Translation systems be evaluated?

A

Human Assessment, BLEU metric, Precision, Recall, NIST, TER, METEOR