Machine Translation and Encoder-Decoder Models Flashcards

Question 1

Q

What is machine translation?

Answer

A

It is the use of computers to translate one language to another

Question 2

Q

What machine translation models exist?

Answer

A

Statistical phrase alignment models, Encoder-Decoder models and Transformer models

Question 3

Q

What type of task is machine translation?

Answer

A

It is a sequence to sequence task (seq2seq)

Question 4

Q

What is the input, output and their lengths for a seq2seq task?

Answer

A

The input X is a sequence of words, the output Y is a sequence of words, but the length of X may not necessarily equal the length of Y

Question 5

Q

Besides machine translation, what are some other seq2seq tasks?

Answer

A

Question → Answer

Sentence → Clause

Document → Abstract

Question 6

Q

What do universal aspects mean in regards to the human language?

Answer

A

These are aspects that are true, or statistically mostly true for all languages

Question 7

Q

What are some examples of universal aspects in the human language?

Answer

A

Nouns/Verbs, Greetings, Politeness/Rude

Question 8

Q

What are translation divergences?

Answer

A

These are areas where languages differ

Question 9

Q

What are some examples of translation divergences?

Answer

A

Idiosyncrasies and lexical differences

Systematic differences

Question 10

Q

What is the study of translation divergences called?

Answer

A

Linguistic Typology

Question 11

Q

What is Word Order Typology?

Answer

A

It is a way of ordering words in different ways, these can include:

Subject-Verb-Object (SVO)
Subject-Object-Verb (SOV)
Verb-Subject-Object (VSO)

Question 12

Q

What is the Encoder-Decoder model?

Answer

A

For an input sequence, we have an encoder, that encodes the input to a context vector, which is then sent to a decoder that generates the output sequence.

Question 13

Q

What can an encoder be?

Answer

A

LSTM, GRU, CNN, Transformers

Question 14

Q

What is a context vector?

Answer

A

It is the last hidden layer of the encoder, which is used as the input to the decoder

Question 15

Q

What does a language model try to do?

Answer

A

Predict the next word in a sequence Y based on the previous word

Question 16

Q

How is a translation model different to a language model?

Answer

A

It predicts the next word in the sequence Y based on the previous target word AND the full source sequence X

Question 17

Q

Explain how the encoder-decoder model shown in the image works

Answer

A

We have a single hidden layer that takes as an input the embeddings of the source text, we then have a separator and the predicted words based on its training. The predicted words are used in the prediction of the next word until the end is reached. The key is that the final hidden layer of the last input word is fed into the decoder which predicts the target words

Question 18

Q

By using the hidden layer at the end of the sentence in the machine translation model, what is avoided?

Answer

A

It avoids the word typology problems as we have full knowledge of where the sentence starts and ends, meaning that before we start to translate the first word, we know what the end word is

Question 19

Q

How is the encoder trained in machine translation models?

Answer

A

The input words are embedded using an embedding layer and are fed in one at a time to the encoder until the full input has been seen.

Question 20

Q

Which state is the final hidden layer of the encoder fed into the decoder?

Answer

A

It is fed into every single state of the decoder

Question 21

Q

What are the inputs at each step in the decoder?

Answer

A

The context vector c (final hidden state of encoder), the previous output, y_t-1 and the previous hidden state of the decoder h_t-1

Question 22

Q

What is the typical loss function for a machine translation model?

Answer

A

It is a cross entropy loss function

Question 23

Q

What is used during training of machine translation but not inference?

Answer

A

Teacher forcing, as we want to ensure we train on exact translations of the words

Question 24

Q

What is the total loss per sentence in machine translation?

Answer

A

It is the averages loss across all target words

Question 25

Q

Why is a separator token added to the start of the target sequence Y?

Answer

A

It is because the decoder needs a previous word embedding to compute a prediction, as nothing has been predicted, we need an initial input entry for the first word

Question 26

Q

Given that the weights of previous tokens decay as the sequence is processed, how is access to previous hidden states without diminishing the weights achieved?

Answer

A

It is achieved using attention

Question 27

Q

What can be used instead of the static context vector that has been shown to be better?

Answer

A

An attention vector

Question 28

Q

What are some types of attention layers?

Answer

A

Dot-product attention

Additive attention

Self-attention

Question 29

Q

What does the image show?

Answer

A

It shows how the context vector is replaced with the attention vector. Each of the hidden states are multiplied by some attention weight, and then a weighted sum is used to give us the context vector

Question 30

Q

What problem can using an attention vector over the last encoder layer overcome?

Answer

A

In some languages, if the start/end of the sentence was more important, then some models would reverse the sentence to compensate for that. Attention avoids this through learning which parts to focus on

Question 31

Q

What is a problem of using the argmax function with machine translation?

Answer

A

There is no hindsight to allow us to revisit choices at previous time steps if we have got something wrong.

Question 32

Q

What is greedy decoding?

Answer

A

It is where argmax is used such that we cannot revisit previous choices - we are stuck with the choice we made

Question 33

Q

What does Beam Search allow for?

Answer

A

It allows for you to revisit previous options if they become better than the current option that is currently being pursued

Question 34

Q

How does a beam search decoder work?

Answer

A

It keeps a memory of the k-best sequence options (hypotheses) at any decoding step

Question 35

Q

What is the beam width?

Answer

A

It is the memory size k

Question 36

Q

What happens at each step in a beam search?

Answer

A

All k hypotheses are extended by V predicted tokens, which is the possible tokens.

The best k sequences from k x V hypotheses are then selected for memory

Question 37

Q

Explain what the image below shows, using a beam width of 2.

Answer

A

We start with the start of the sentence. We first take the most probable two steps, so we have two sequences ‘start arrived’ and ‘start the’. At the next step, we have 4 possible options that we can choose. By using logs, we can add together the probabilities of each word in the sequence. In doing so, we can see that the two next most probable sequences from where we currently sit are ‘start the green’ and ‘start the witch’. We repeat this again, and see that the next most probable sequences are ‘start the green witch’ and ‘start the witch arrived’. This is repeated until we get to the end.

Question 38

Q

What is the typical beam width in actual systems, and what is the problem of going above this size?

Answer

A

The typical size is 4 to 10, bigger than 10 can take a long time to decode which can slow the cycles.

Question 39

Q

What type of vocabulary is used in a seq2seq model?

Answer

A

A fixed vocabulary (e.g. 50k limited by GPU memory) although BPE or WordPiece can also be used

Question 40

Q

What decoder should be used to get fast results, and what should be used to get the best results?

Answer

A

Greedy decoder for fast, Beam decoder for best

Question 41

Q

What are the training data options for machine translation?

Answer

A

Parallel Corpus - A corpus of two languages containing sentences that are aligned

Monolingual Corpus - These are stacks of posts in one language and stacks of posts in another, although there is no connection between the two. They are very large but require interesting techniques

Question 42

Q

What is backtranslation?

Answer

A

Train a model using a parallel corpus, apply the model to the massive monolingual corpus and use this to train the model as it is sentence aligned and compare

Question 43

Q

How can Machine Translation systems be evaluated?

Answer

A

Human Assessment, BLEU metric, Precision, Recall, NIST, TER, METEOR

Brainscape's Knowledge GenomeTM

Machine Translation and Encoder-Decoder Models Flashcards

Brainscape's Knowledge Genome^TM