Text Generation 2: Autoregressive encoder-decoder with RNNs and attention Flashcards

Question 1

Q

What are 3 NLP tasks?

Question 2

Q

How can sequence classification (sentiment classification for example) be implemented with RNNs?

Answer

A

Intermediate outputs are discarted and only the final one is passed through MLP and/or softmax to get the final class.

Question 3

Q

How can sequence labeling (POS Tagging for example) be implemented with RNNs?

Answer

A

We pass each intermediate output through MLP and/or softmax to get a sequence of labels (predict label for each token).

Question 4

Q

Why RNNs can’t be used for text generation?

Answer

A

They can but they have a constraint that the output length has to be the same as input length which is usually not the case. What we can do is train 2 RNNs, encoder and decoder.

Question 5

Q

What could be quick fixes so that RNN can generate arbitrary-length text?

Answer

A

For longer outputs than inputs: Append <PAD> special tokens to the input and keep generating (cumbersome)</PAD>
For shorter outputs than inputs: Ignore all outputs after <EOS> token</EOS>

Question 6

Q

Explain seq-2-seq RNN model.

Answer

A

Idea: separate the architecture into encoder and decoder. They have separate parameters, and the initial state of decoder is the last state of encoder.

Encoder: normal RNN, we encode the whole sequence into one vector This vector is used as an initial hidden state h_0 of the decoder. We can also transform it through some NN if decoder and encoder work in different dimensionalities.

The initial token x_0 of decoder is <BOS> (beginning of sequence) token.</BOS>

Every other input token x_i is the previous output of the decoder (or, during the training, use teacher forcing to use the correct output instead of the decoder-output)

Stop generating until the <EOS> is generated, or when max is reached</EOS>

Question 7

Q

What is teacher forcing?

Answer

A

When training decoders (RNN), we use output of the cell as the input token of the next cell. The problem is that, if the model makes one mistake, this is propagated until the end and the loss is greater than it is supposed to be.

That’s why we use teacher forcing which essentially means that, instead of using decoder output, we use the true output (during training) if the parameter p=1. p is telling us the probability of using the true output compared to the predicted output.

Question 8

Q

Why are LSTM bad in machine translation?

Answer

A

Because of the long dependency problems. The current output might depend on a token 20 tokens ago, and this can be easily lost in LSTM. The more tokens the RNN reads, the less it remembers individual tokens.

Question 9

Q

Explain query, key, value in attention

Answer

A

query: The representation of the state that is asking/querying for information (vector)
keys: Representations that we compare the query to, to see how relevant each candidate/value is (matrix)
values: The information that is sent

a_i = softmax((q_T \cdot k_i) / sqrt(d_dec))
- the dot product is scaled because of softmax. We want to preserve the scale of variance, so that the softmax doesn’t blow up because of the exponential operation

a_i = e_(a_i) / sum(e_(a_j))

a = sum(a_i * v_i)

Question 10

Q

What is cross-attention?

Answer

A

Cross-attention or encoder-decoder attention is telling decoder, at output t, what input tokens are relevant for generating the output token. This is achieved using query, key, value vector/matrices.

At the decoder step t, we compute the attention mechanism over the encoder states. Then, we concatenate the output of the attention and the current decoder state and is fed to linear transformation and the final output is calculated ????

Question 11

Q

Is attention explanation?

Answer

A

can show us what model “thinks” is the reason for some output. For example, sexist model might say that Ms. and medicine in the text is the reason why the label is physician.

Question 12

Q

What are the attention design choices?

Answer

A

The energy (similarity) function: How do we compute energy/similarity/relevance between two state representations
Parametrization: How/if we apply transformations to attention components
Direction: Which components we attend over