DL-08 - Transformers Flashcards by Rikard Donnelly

DL-08 - Transformers

What are some problems with RNNs/LSTMs? (4)

Difficult to train.
Very long gradient paths.
Transfer learning never really works.
Recurrence is against the principle of parallel computation.

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

What is the name of the paper where transformers were introduced?

Attention is All You Need

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

When was the transformers paper (Attention is All You Need) published?

2017

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

Who were the authors of the transformers paper (Attention is All You Need)

Vaswani et al.

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

What do transformers use instead of recurrence? (2)

Context windows (input more data at the same time)
self-attention

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

In what areas are transformers currently very good? (2)

NLP
Computer vision

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

What does a transformer do? (2)

Encodes an input into a single vector
Decodes the vector back into output

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

Does transformers use recursion?

No, they avoid it.

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

Why can encoders be so fast?

No recursion -> parallel computation.

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

What are the main characteristics of transformers? (3)

non-sequential
self-attention
positional encoding

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

Describe what is meant when we say transformers are non-sequential.

Sentences are processed as a whole, rather than word by word.

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

“Sentences are processed as a whole, rather than word by word.”
What is this property called?

Non-sequential.

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

Describe self-attention.

A new unit used to compute similarity scores between words in a sentence.

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

Describe positional encoding.

Encodes information related to a position of a token in a sentence.

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

“Encodes information related to a position of a token in a sentence.”
What is this called?

positional encoding

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

“A new unit used to compute similarity scores between words in a sentence.”
What is this called?

Self-atttention.

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

What is the method transformers use to understand relevant words while processing a current word?

Self-atttention.

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

Why don’t transformers suffer from short-term memory?

Because they use self-attention mechanisms, allowing them to take the entire input sequence into account simultaneously.

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

What parts does “encoder embedding” consist of?

Word/input embedding
Positional embedding

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

What does positional embedding do?

It injects positional information (distance between different words) into the input embeddings.

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

What functions does the “Attention is all you need” paper use for positional encoding?

Sin/cos

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

In the image, what is
- d_model
- i
- pos

(See image)

d_model: Embedding size
i: Depends on the position in the embedding dimension
pos: Position index in the incoming sequence.

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

How is positional information added to the embeddings?

They’re added element-wise (Addition).

How well did you know this?

Not at all

Perfectly

DL-08 - Transformers

What are the sub-modules of the encoder? (2)

Multi-headed attention
Fully connected feed forward network

How well did you know this?

Not at all

Perfectly

# DL-08 - Transformers What do each of the sub-modules have (both attention head and FC module)? (2)

- Residual connections - Normalization layer

# DL-08 - Transformers What are the vectors in self-attention called? (3)

- Query - Key - Value

# DL-08 - Transformers How are the query, key and value vectors created?

Separate weight matrix for each. Each vector is simply a multiplication with the incoming embedding vector. (See image)

# DL-08 - Transformers How is "score" calculated in self-attention? @CHECK ME

By taking the dot product of the Q and K vectors.

# DL-08 - Transformers What do you get when you take the dot product of the Q and K vectors? @CHECK_ME

The "score".

# DL-08 - Transformers How do you ensure stable gradients?

Scale down the "scores".

# DL-08 - Transformers What is the formula for scaling the scores?

(See image)

# DL-08 - Transformers How do you normalize the scores?

Use softmax to produce attention weights with probability between 0-1.

# DL-08 - Transformers How do you calculate the final attention weights?

Use softmax normalize the scores, to produce attention weights with probability between 0-1.

# DL-08 - Transformers How do you get the output vector of a self-attention unit?

Calculate the self-attention score/weights, then multiply it by the value vector.

# DL-08 - Transformers How do you write the self-attention block as a single matrix operation?

(See image)

# DL-08 - Transformers What is depicted in the image? (See image)

Most of the self-attention block as a matrix operation.

# DL-08 - Transformers What is Multi-headed attention?

A block that uses N different self-attentions (called heads) with different Q, K, V to produce outputs Z_1, Z_2, Z_3, ..., Z_n.

# DL-08 - Transformers What is it called when you use multiple self-attention blocks in the same layer?

Multi-headed attention

# DL-08 - Transformers What is a self-attention block called?

A head.

# DL-08 - Transformers What is a head?

One self-attention block.

# DL-08 - Transformers What does multi-headed attention do for the layer?

It allows the layer to have multiple representation subspaces.

# DL-08 - Transformers What is done to the outputs of the individual self-attention blocks, to make them outputs of a multi-headed attention block?

They're concatenated into single matrix and multiplied with a weight matrix W^O.

# DL-08 - Transformers What is in the image? (See image)

A transformer encoder block.

# DL-08 - Transformers Label the masked parts of the image. (See image)

(See image)

# DL-08 - Transformers What is in the image? (See image)

A transformer decoder block

# DL-08 - Transformers Label the masked parts of the image. (See image)

(See image)

# DL-08 - Transformers What happens to the outputs of a transformer decoder block?

(See image)

# DL-08 - Transformers What sub-layers does the decoder have? (3)

- 2 multi-headed attention layers, - a feed-forward layer, - residual connections and normalization layers after each sub-layer

# DL-08 - Transformers What are decoder embedding comprised of? (2)

- Output word embedding - Positional embedding

# DL-08 - Transformers What is fed into the first multi-head attention layer in a Transformer decoder?

The output of the Transformer decoder embedding.

# DL-08 - Transformers In the transformer's decoder, how is the first attention head different than the encoder's attention head?

It uses a look-ahead mask.

# DL-08 - Transformers In sequence models, what is the purpose of a look-ahead mask used in a decoder with multi-head attention?

To prevent the decoder from conditioning to future tokens.

# DL-08 - Transformers How do you create a look-ahead mask?

(See image)

# DL-08 - Transformers Where is the mask applied in a decoder?

(See image)

# DL-08 - Transformers What are the input to the decoder's 2nd attention head? (2)

- Query and key from encoder - Value from 1st attention head.

# DL-08 - Transformers What is another way to think of the 2nd attention head in the decoder?

Encoder-decoder attention

# DL-08 - Transformers What happens to the output of the decoder block?

It's sent through a linear classifier, then a softmax activation. (See image)

# DL-08 - Transformers How do we interpret the output of a transformer?

It's a probability distribution over the words in your vocabulary. (We try to predict the next word.)

# DL-08 - Transformers What is a stacked encoder/decoder?

Adding multiple layers of encoders/decoders to improve performance. (See image)

# DL-08 - Transformers What are some popular transformers mentioned in the paper? (5)

- BERT - OpenAI's GPT family - Google Bard - XLNet - T5

# DL-08 - Transformers What is BERT short for?

Bidirectional Encoder Representations from Transformers

# DL-08 - Transformers What is GPT short for?

Generative Pretrained Transformer

# DL-08 - Transformers What is T5 short for? (TTTTT)

Text-To-Text Transfer Transformer

# DL-08 - Transformers When was BERT released?

2018

# DL-08 - Transformers When was the first GPT released?

2018

# DL-08 - Transformers When was XLNet released?

2020

# DL-08 - Transformers When was T5 released?

2020

# DL-08 - Transformers What are the two novel techniques used by BERT? (2)

- Masked Language Model (MLM) - Next Sentence Prediction (NSP)

# DL-08 - Transformers What is MLM short for?

Masked Language Model

# DL-08 - Transformers What is NSP short for?

Next Sentence Prediction

# DL-08 - Transformers What does BERT used to better determine context?

Bidirectional

# DL-08 - Transformers What are some tasks where BERT is useful? (3)

- Classification - Fill in the blanks - Question answering

# DL-08 - Transformers What variants of BERT mentioned in the lecture slides? (4)

- RoBERTa - ALBERT - StructBERT - DeBERTa

# DL-08 - Transformers What's special about RoBERTa?

A Robustly Optimized BERT Pretraining Approach

# DL-08 - Transformers What's special about ALBERT?

A Lite BERT for Self-supervised Learning of Language Representations

# DL-08 - Transformers What's special about StructBERT?

Incorporating Language Structures into Pre-training for Deep Language Understanding

# DL-08 - Transformers What objective was GPT trained with?

Predicting the next word in a sequence.

# DL-08 - Transformers How are GPTs traiend?

Using RLHF (Reinforcement learning from human feedback)

# DL-08 - Transformers What is RLHF short for?

Reinforcement learning from human feedback

# DL-08 - Transformers How many layers does GPT-3 have?

96 layers

# DL-08 - Transformers How many attention heads per layer does GPT-3 have?

96 attention heads

# DL-08 - Transformers What is a vision transformer?

Using transformers for computer vision?

# DL-08 - Transformers What is ViT short for?

Vision transformer

# DL-08 - Transformers Who first published vision transformers for imagenet?

Dosovitskiy et al. from Google Brain

# DL-08 - Transformers When were vision transformers first published?

2020

# DL-08 - Transformers What is the architecture of vision transformers?

(See image)