Sequence Processing, Transformers and Attention Flashcards by Ben Boyce

What does LSTM stand for?

Long Short Term Memory

How well did you know this?

Not at all

Perfectly

Why is long distant information critical to many language apps?

Words at the start of the sentence can often have a large impact on the end of the sentence

How well did you know this?

Not at all

Perfectly

What is an issue of using RNNs?

The information value from the start of the sentence can degrade as we move down the sentence and becomes less accessible

How well did you know this?

Not at all

Perfectly

What is the vanishing gradients problem?

The vanishing gradients problem is where gradients get smaller and smaller as Gradient Descent progresses to deeper stacked layers, eventually ‘vanishing’. The connection weights are virtually unchanged and therefore training loss does not converge

How well did you know this?

Not at all

Perfectly

What is the exploding gradients problem?

It is the opposite of the vanishing gradients problem - as Gradient Descent progresses, the gradients get larger and larger causing the training loss to not converge.

How well did you know this?

Not at all

Perfectly

What does the image show?

This is an LSTM cell, an LSTM layer consists of many of these cells

How well did you know this?

Not at all

Perfectly

Explain how an LSTM cell, similar to the one shown in the image, works.

The last hidden layer, h_t-1, is still provided as an input alongside the current input x_t. An additional memory unit, which is c_t-1, which is a form of secondary memory that can record longer term things. The LSTM cell can be trained to learn what to remember and what to forget, which is where the gates are used. Each gate is trained with an activation function in order to compute their values. Given the inputs, they go through the gates and pass through the activation function to generate output values, which in turn generate the output vector y_t, the next long term state c_t and the next short term state h_t. To do the training, each gate has a fully connected dense layer which is used to train and update the weights.

How well did you know this?

Not at all

Perfectly

What is a GRU?

It is a Gated Recurrent Unit

How well did you know this?

Not at all

Perfectly

What does a GRU do?

It merges both long and short term state vectors

How well did you know this?

Not at all

Perfectly

Are GRUs simpler or more complex than an LSTM?

They are simpler and more efficient to compute while still producing similar results.

How well did you know this?

Not at all

Perfectly

In simple terms, explain the difference between the 4 concepts shown in the image.

The MLP simply uses the input to produce the output

The RNN will aggregate the last hidden layer as well as the input X

The LSTM cell will aggregate the last hidden layer, has an input X, and has long term memory while learning what to remember

The GRU simply merges the long term memory with the short term memory and learns what to remember and forget

How well did you know this?

Not at all

Perfectly

What does the transformer architecture show?

Using the concept of self-attention, the LSTM and GRUs could be outperformed

How well did you know this?

Not at all

Perfectly

What are the problems of GRUs and RNNs that transformers solve?

While they were designed to help with longer distance sequences, they still could not cope with very long distant sequences

How well did you know this?

Not at all

Perfectly

What is the key concept of the transformer architecture?

It uses an attention layer that does not worry about the length of the sequence as it has access to all potential inputs of the sequence, learning which ones to attend to.

How well did you know this?

Not at all

Perfectly

Explain the architecture of the image showing the Transformer model.

It is an encoder-decoder model. We have stacks of multi-head attention layers for the encoder, and stacks of multi-head attention layers for the decoder, of which we have N times of them. The encoder feeds into the decoder and we have a linear layer with softmax to give an output to the problem.

How well did you know this?

Not at all

Perfectly

Why is a positional encoding used in the transformer model?

Study These Flashcards

It is because the model takes all the words as one big input vector, so it does not have access to the sequential positions of each of the words. Therefore, positional encoding is added to provide some knowledge of the positions

What are attention layers?

Study These Flashcards

They compare items to other items to reveal their relevance in the current context

How does the self-attention layer map input sequences?

Study These Flashcards

It maps input sequences to the output sequences of the same length, we have contributions from all the other inputs in the sequence up until that point

What is an advantage of using self-attention layers?

Study These Flashcards

The computation for each output is computationally independent, meaning that it can be parallelized and efficient to compute

In self-attention, how do we compare items?

Study These Flashcards

All inputs are embedding vectors for each of the words which are then processed in pairs. All previous inputs are compared to the current input using the dot product function

What equations are required to compare the items with self-attention?

Study These Flashcards

For sequence x₁..x₃ we need to compute score (x₃,x₁), score (x₃,x₂) and score (x₃,x₃)
Then we normalise each score using softmax to create a vector of attention weights called alpha
Finally, we compute the output by summing the inputs in the sequence weighted by alpha

How can we provide trainable weights for the attention function?

Study These Flashcards

We use several weight matrices for this

Why does a transformer use embeddings?

Study These Flashcards

They make the approach differentiable, meaning it can be used with gradient descent and therefore it fits well within our deep learning stack - if it was not then we would not be able to compute loss and perform the training cycle

In self-attention, what does the matrix Q allow us to do?

Study These Flashcards

It allows us to map the input embedding x to the query embedding q

In self-attention, what does the matrix K allow us to do?

It allows us to map the input embedding x to be mapped to the key embedding k

In self-attention, what does the matrix V allow us to do?

It allows the input embedding x to mapped to a potential value embedding v

How is the score for an input pair computed using self-attention?

We take the dot product of the query embedding for the current word and the key embedding of the previous word, divided by the square-root of key dimension, ensuring the weights are appropriate

When applying the self-attention, what do we do?

We compute the similarity of the queries and the keys using the softmax function

What do we do with future characters when we are training?

We mask the values after the current position as using information after the current position to train would be cheating

How are positional embeddings learnt?

They are learnt either during training (e.g. additional position input alongside the words) or they can be generated using a positional encoding function (e.g. cosine/sine function)

What is important about the dimensions of the positional embeddings?

They are the same as the dimensions of the word embeddings so they can be summed together to give a position-aware input embedding

What is a self-attention layer called?

A head

Why do we want N heads?

It is the same reason as to why we use stacked RNNs, using multiple of them can capture different patters, so more layers can lead to more complicated patterns being captured around the language used

What happens to the output of all the heads used in a transformer?

They are concatenated

How does a transformer work in regards to a text completion task?

The input is a sequence of words, the output is a prediction of words to complete the sequence. The transformer takes the input text as an embedding, a number of transformer blocks which are N stacks of multihead self-attention layers, which is passed to a softmax function to make a prediction for the most probable output word

What is used during training with a transformer for text completion but not in testing/actual use?

During training, teacher forcing is likely used to ensure it is correct, but in the inference mode, the predicted word is used to predict the following word, and so.

Sequence Processing, Transformers and Attention Flashcards

(36 cards)