Sequence Processing, Transformers and Attention Flashcards

1
Q

What does LSTM stand for?

A

Long Short Term Memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is long distant information critical to many language apps?

A

Words at the start of the sentence can often have a large impact on the end of the sentence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an issue of using RNNs?

A

The information value from the start of the sentence can degrade as we move down the sentence and becomes less accessible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the vanishing gradients problem?

A

The vanishing gradients problem is where gradients get smaller and smaller as Gradient Descent progresses to deeper stacked layers, eventually ‘vanishing’. The connection weights are virtually unchanged and therefore training loss does not converge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the exploding gradients problem?

A

It is the opposite of the vanishing gradients problem - as Gradient Descent progresses, the gradients get larger and larger causing the training loss to not converge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the image show?

A

This is an LSTM cell, an LSTM layer consists of many of these cells

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain how an LSTM cell, similar to the one shown in the image, works.

A

The last hidden layer, ht-1, is still provided as an input alongside the current input xt. An additional memory unit, which is ct-1, which is a form of secondary memory that can record longer term things. The LSTM cell can be trained to learn what to remember and what to forget, which is where the gates are used. Each gate is trained with an activation function in order to compute their values. Given the inputs, they go through the gates and pass through the activation function to generate output values, which in turn generate the output vector yt, the next long term state ct and the next short term state ht. To do the training, each gate has a fully connected dense layer which is used to train and update the weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a GRU?

A

It is a Gated Recurrent Unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does a GRU do?

A

It merges both long and short term state vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Are GRUs simpler or more complex than an LSTM?

A

They are simpler and more efficient to compute while still producing similar results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In simple terms, explain the difference between the 4 concepts shown in the image.

A

The MLP simply uses the input to produce the output

The RNN will aggregate the last hidden layer as well as the input X

The LSTM cell will aggregate the last hidden layer, has an input X, and has long term memory while learning what to remember

The GRU simply merges the long term memory with the short term memory and learns what to remember and forget

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the transformer architecture show?

A

Using the concept of self-attention, the LSTM and GRUs could be outperformed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the problems of GRUs and RNNs that transformers solve?

A

While they were designed to help with longer distance sequences, they still could not cope with very long distant sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the key concept of the transformer architecture?

A

It uses an attention layer that does not worry about the length of the sequence as it has access to all potential inputs of the sequence, learning which ones to attend to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the architecture of the image showing the Transformer model.

A

It is an encoder-decoder model. We have stacks of multi-head attention layers for the encoder, and stacks of multi-head attention layers for the decoder, of which we have N times of them. The encoder feeds into the decoder and we have a linear layer with softmax to give an output to the problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is a positional encoding used in the transformer model?

A

It is because the model takes all the words as one big input vector, so it does not have access to the sequential positions of each of the words. Therefore, positional encoding is added to provide some knowledge of the positions

17
Q

What are attention layers?

A

They compare items to other items to reveal their relevance in the current context

18
Q

How does the self-attention layer map input sequences?

A

It maps input sequences to the output sequences of the same length, we have contributions from all the other inputs in the sequence up until that point

19
Q

What is an advantage of using self-attention layers?

A

The computation for each output is computationally independent, meaning that it can be parallelized and efficient to compute

20
Q

In self-attention, how do we compare items?

A

All inputs are embedding vectors for each of the words which are then processed in pairs. All previous inputs are compared to the current input using the dot product function

21
Q

What equations are required to compare the items with self-attention?

A
  1. For sequence x1..x3 we need to compute score (x3,x1), score (x3,x2) and score (x3,x3)
  2. Then we normalise each score using softmax to create a vector of attention weights called alpha
  3. Finally, we compute the output by summing the inputs in the sequence weighted by alpha
22
Q

How can we provide trainable weights for the attention function?

A

We use several weight matrices for this

23
Q

Why does a transformer use embeddings?

A

They make the approach differentiable, meaning it can be used with gradient descent and therefore it fits well within our deep learning stack - if it was not then we would not be able to compute loss and perform the training cycle

24
Q

In self-attention, what does the matrix Q allow us to do?

A

It allows us to map the input embedding x to the query embedding q

25
Q

In self-attention, what does the matrix K allow us to do?

A

It allows us to map the input embedding x to be mapped to the key embedding k

26
Q

In self-attention, what does the matrix V allow us to do?

A

It allows the input embedding x to mapped to a potential value embedding v

27
Q

How is the score for an input pair computed using self-attention?

A

We take the dot product of the query embedding for the current word and the key embedding of the previous word, divided by the square-root of key dimension, ensuring the weights are appropriate

28
Q

When applying the self-attention, what do we do?

A

We compute the similarity of the queries and the keys using the softmax function

29
Q

What do we do with future characters when we are training?

A

We mask the values after the current position as using information after the current position to train would be cheating

30
Q

How are positional embeddings learnt?

A

They are learnt either during training (e.g. additional position input alongside the words) or they can be generated using a positional encoding function (e.g. cosine/sine function)

31
Q

What is important about the dimensions of the positional embeddings?

A

They are the same as the dimensions of the word embeddings so they can be summed together to give a position-aware input embedding

32
Q

What is a self-attention layer called?

A

A head

33
Q

Why do we want N heads?

A

It is the same reason as to why we use stacked RNNs, using multiple of them can capture different patters, so more layers can lead to more complicated patterns being captured around the language used

34
Q

What happens to the output of all the heads used in a transformer?

A

They are concatenated

35
Q

How does a transformer work in regards to a text completion task?

A

The input is a sequence of words, the output is a prediction of words to complete the sequence. The transformer takes the input text as an embedding, a number of transformer blocks which are N stacks of multihead self-attention layers, which is passed to a softmax function to make a prediction for the most probable output word

36
Q

What is used during training with a transformer for text completion but not in testing/actual use?

A

During training, teacher forcing is likely used to ensure it is correct, but in the inference mode, the predicted word is used to predict the following word, and so.