Text Generation 2: Autoregressive encoder-decoder with RNNs and attention Flashcards

(16 cards)

1
Q

What are 3 NLP tasks?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can sequence classification (sentiment classification for example) be implemented with RNNs?

A

Intermediate outputs are discarted and only the final one is passed through MLP and/or softmax to get the final class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can sequence labeling (POS Tagging for example) be implemented with RNNs?

A

We pass each intermediate output through MLP and/or softmax to get a sequence of labels (predict label for each token).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why RNNs can’t be used for text generation?

A

They can but they have a constraint that the output length has to be the same as input length which is usually not the case. What we can do is train 2 RNNs, encoder and decoder.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What could be quick fixes so that RNN can generate arbitrary-length text?

A
  1. For longer outputs than inputs: Append <PAD> special tokens to the input and keep generating (cumbersome)</PAD>
  2. For shorter outputs than inputs: Ignore all outputs after <EOS> token</EOS>
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain seq-2-seq RNN model.

A

Idea: separate the architecture into encoder and decoder. They have separate parameters, and the initial state of decoder is the last state of encoder.

Encoder: normal RNN, we encode the whole sequence into one vector This vector is used as an initial hidden state h_0 of the decoder. We can also transform it through some NN if decoder and encoder work in different dimensionalities.

The initial token x_0 of decoder is <BOS> (beginning of sequence) token.</BOS>

Every other input token x_i is the previous output of the decoder (or, during the training, use teacher forcing to use the correct output instead of the decoder-output)

Stop generating until the <EOS> is generated, or when max is reached</EOS>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is teacher forcing?

A

When training decoders (RNN), we use output of the cell as the input token of the next cell. The problem is that, if the model makes one mistake, this is propagated until the end and the loss is greater than it is supposed to be.

That’s why we use teacher forcing which essentially means that, instead of using decoder output, we use the true output (during training) if the parameter p=1. p is telling us the probability of using the true output compared to the predicted output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why are LSTM bad in machine translation?

A

Because of the long dependency problems. The current output might depend on a token 20 tokens ago, and this can be easily lost in LSTM. The more tokens the RNN reads, the less it remembers individual tokens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain query, key, value in attention

A
  • query: The representation of the state that is asking/querying for information (vector)
  • keys: Representations that we compare the query to, to see how relevant each candidate/value is (matrix)
  • values: The information that is sent

a_i = softmax((q_T \cdot k_i) / sqrt(d_dec))
- the dot product is scaled because of softmax. We want to preserve the scale of variance, so that the softmax doesn’t blow up because of the exponential operation

a_i = e_(a_i) / sum(e_(a_j))

a = sum(a_i * v_i)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is cross-attention?

A

Cross-attention or encoder-decoder attention is telling decoder, at output t, what input tokens are relevant for generating the output token. This is achieved using query, key, value vector/matrices.

At the decoder step t, we compute the attention mechanism over the encoder states. Then, we concatenate the output of the attention and the current decoder state and is fed to linear transformation and the final output is calculated ????

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Is attention explanation?

A
  • can show us what model “thinks” is the reason for some output. For example, sexist model might say that Ms. and medicine in the text is the reason why the label is physician.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the attention design choices?

A
  1. The energy (similarity) function: How do we compute energy/similarity/relevance between two state representations
  2. Parametrization: How/if we apply transformations to attention components
  3. Direction: Which components we attend over
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain the energy function design choices for attention

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the parametrization design choices for attention

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the direction of attention design choices for attention

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly