DL-08 - Transformers Flashcards

1
Q

DL-08 - Transformers

What are some problems with RNNs/LSTMs? (4)

A
  • Difficult to train.
  • Very long gradient paths.
  • Transfer learning never really works.
  • Recurrence is against the principle of parallel computation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

DL-08 - Transformers

What is the name of the paper where transformers were introduced?

A

Attention is All You Need

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

DL-08 - Transformers

When was the transformers paper (Attention is All You Need) published?

A

2017

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

DL-08 - Transformers

Who were the authors of the transformers paper (Attention is All You Need)

A

Vaswani et al.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

DL-08 - Transformers

What do transformers use instead of recurrence? (2)

A
  • Context windows (input more data at the same time)
  • self-attention
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

DL-08 - Transformers

In what areas are transformers currently very good? (2)

A
  • NLP
  • Computer vision
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

DL-08 - Transformers

What does a transformer do? (2)

A
  • Encodes an input into a single vector
  • Decodes the vector back into output
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

DL-08 - Transformers

Does transformers use recursion?

A

No, they avoid it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

DL-08 - Transformers

Why can encoders be so fast?

A

No recursion -> parallel computation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

DL-08 - Transformers

What are the main characteristics of transformers? (3)

A
  • non-sequential
  • self-attention
  • positional encoding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

DL-08 - Transformers

Describe what is meant when we say transformers are non-sequential.

A

Sentences are processed as a whole, rather than word by word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

DL-08 - Transformers

“Sentences are processed as a whole, rather than word by word.”
What is this property called?

A

Non-sequential.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

DL-08 - Transformers

Describe self-attention.

A

A new unit used to compute similarity scores between words in a sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

DL-08 - Transformers

Describe positional encoding.

A

Encodes information related to a position of a token in a sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DL-08 - Transformers

“Encodes information related to a position of a token in a sentence.”
What is this called?

A

positional encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

DL-08 - Transformers

“A new unit used to compute similarity scores between words in a sentence.”
What is this called?

A

Self-atttention.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

DL-08 - Transformers

What is the method transformers use to understand relevant words while processing a current word?

A

Self-atttention.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

DL-08 - Transformers

Why don’t transformers suffer from short-term memory?

A

Because they use self-attention mechanisms, allowing them to take the entire input sequence into account simultaneously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

DL-08 - Transformers

What parts does “encoder embedding” consist of?

A
  • Word/input embedding
  • Positional embedding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

DL-08 - Transformers

What does positional embedding do?

A

It injects positional information (distance between different words) into the input embeddings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

DL-08 - Transformers

What functions does the “Attention is all you need” paper use for positional encoding?

A

Sin/cos

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

DL-08 - Transformers

In the image, what is
- d_model
- i
- pos

(See image)

A
  • d_model: Embedding size
  • i: Depends on the position in the embedding dimension
  • pos: Position index in the incoming sequence.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

DL-08 - Transformers

How is positional information added to the embeddings?

A

They’re added element-wise (Addition).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

DL-08 - Transformers

What are the sub-modules of the encoder? (2)

A
  • Multi-headed attention
  • Fully connected feed forward network
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
# DL-08 - Transformers What do each of the sub-modules have (both attention head and FC module)? (2)
- Residual connections - Normalization layer
26
# DL-08 - Transformers What are the vectors in self-attention called? (3)
- Query - Key - Value
27
# DL-08 - Transformers How are the query, key and value vectors created?
Separate weight matrix for each. Each vector is simply a multiplication with the incoming embedding vector. (See image)
28
# DL-08 - Transformers How is "score" calculated in self-attention? @CHECK ME
By taking the dot product of the Q and K vectors.
29
# DL-08 - Transformers What do you get when you take the dot product of the Q and K vectors? @CHECK_ME
The "score".
30
# DL-08 - Transformers How do you ensure stable gradients?
Scale down the "scores".
31
# DL-08 - Transformers What is the formula for scaling the scores?
(See image)
32
# DL-08 - Transformers How do you normalize the scores?
Use softmax to produce attention weights with probability between 0-1.
33
# DL-08 - Transformers How do you calculate the final attention weights?
Use softmax normalize the scores, to produce attention weights with probability between 0-1.
34
# DL-08 - Transformers How do you get the output vector of a self-attention unit?
Calculate the self-attention score/weights, then multiply it by the value vector.
35
# DL-08 - Transformers How do you write the self-attention block as a single matrix operation?
(See image)
36
# DL-08 - Transformers What is depicted in the image? (See image)
Most of the self-attention block as a matrix operation.
37
# DL-08 - Transformers What is Multi-headed attention?
A block that uses N different self-attentions (called heads) with different Q, K, V to produce outputs Z_1, Z_2, Z_3, ..., Z_n.
38
# DL-08 - Transformers What is it called when you use multiple self-attention blocks in the same layer?
Multi-headed attention
39
# DL-08 - Transformers What is a self-attention block called?
A head.
40
# DL-08 - Transformers What is a head?
One self-attention block.
41
# DL-08 - Transformers What does multi-headed attention do for the layer?
It allows the layer to have multiple representation subspaces.
42
# DL-08 - Transformers What is done to the outputs of the individual self-attention blocks, to make them outputs of a multi-headed attention block?
They're concatenated into single matrix and multiplied with a weight matrix W^O.
43
# DL-08 - Transformers What is in the image? (See image)
A transformer encoder block.
44
# DL-08 - Transformers Label the masked parts of the image. (See image)
(See image)
45
# DL-08 - Transformers What is in the image? (See image)
A transformer decoder block
46
# DL-08 - Transformers Label the masked parts of the image. (See image)
(See image)
47
# DL-08 - Transformers What happens to the outputs of a transformer decoder block?
(See image)
48
# DL-08 - Transformers What sub-layers does the decoder have? (3)
- 2 multi-headed attention layers, - a feed-forward layer, - residual connections and normalization layers after each sub-layer
49
# DL-08 - Transformers What are decoder embedding comprised of? (2)
- Output word embedding - Positional embedding
50
# DL-08 - Transformers What is fed into the first multi-head attention layer in a Transformer decoder?
The output of the Transformer decoder embedding.
51
# DL-08 - Transformers In the transformer's decoder, how is the first attention head different than the encoder's attention head?
It uses a look-ahead mask.
52
# DL-08 - Transformers In sequence models, what is the purpose of a look-ahead mask used in a decoder with multi-head attention?
To prevent the decoder from conditioning to future tokens.
53
# DL-08 - Transformers How do you create a look-ahead mask?
(See image)
54
# DL-08 - Transformers Where is the mask applied in a decoder?
(See image)
55
# DL-08 - Transformers What are the input to the decoder's 2nd attention head? (2)
- Query and key from encoder - Value from 1st attention head.
56
# DL-08 - Transformers What is another way to think of the 2nd attention head in the decoder?
Encoder-decoder attention
57
# DL-08 - Transformers What happens to the output of the decoder block?
It's sent through a linear classifier, then a softmax activation. (See image)
58
# DL-08 - Transformers How do we interpret the output of a transformer?
It's a probability distribution over the words in your vocabulary. (We try to predict the next word.)
59
# DL-08 - Transformers What is a stacked encoder/decoder?
Adding multiple layers of encoders/decoders to improve performance. (See image)
60
# DL-08 - Transformers What are some popular transformers mentioned in the paper? (5)
- BERT - OpenAI's GPT family - Google Bard - XLNet - T5
61
# DL-08 - Transformers What is BERT short for?
Bidirectional Encoder Representations from Transformers
62
# DL-08 - Transformers What is GPT short for?
Generative Pretrained Transformer
63
# DL-08 - Transformers What is T5 short for? (TTTTT)
Text-To-Text Transfer Transformer
64
# DL-08 - Transformers When was BERT released?
2018
65
# DL-08 - Transformers When was the first GPT released?
2018
66
# DL-08 - Transformers When was XLNet released?
2020
67
# DL-08 - Transformers When was T5 released?
2020
68
# DL-08 - Transformers What are the two novel techniques used by BERT? (2)
- Masked Language Model (MLM) - Next Sentence Prediction (NSP)
69
# DL-08 - Transformers What is MLM short for?
Masked Language Model
70
# DL-08 - Transformers What is NSP short for?
Next Sentence Prediction
71
# DL-08 - Transformers What does BERT used to better determine context?
Bidirectional
72
# DL-08 - Transformers What are some tasks where BERT is useful? (3)
- Classification - Fill in the blanks - Question answering
73
# DL-08 - Transformers What variants of BERT mentioned in the lecture slides? (4)
- RoBERTa - ALBERT - StructBERT - DeBERTa
74
# DL-08 - Transformers What's special about RoBERTa?
A Robustly Optimized BERT Pretraining Approach
75
# DL-08 - Transformers What's special about ALBERT?
A Lite BERT for Self-supervised Learning of Language Representations
76
# DL-08 - Transformers What's special about StructBERT?
Incorporating Language Structures into Pre-training for Deep Language Understanding
77
# DL-08 - Transformers What objective was GPT trained with?
Predicting the next word in a sequence.
78
# DL-08 - Transformers How are GPTs traiend?
Using RLHF (Reinforcement learning from human feedback)
79
# DL-08 - Transformers What is RLHF short for?
Reinforcement learning from human feedback
80
# DL-08 - Transformers How many layers does GPT-3 have?
96 layers
81
# DL-08 - Transformers How many attention heads per layer does GPT-3 have?
96 attention heads
82
# DL-08 - Transformers What is a vision transformer?
Using transformers for computer vision?
83
# DL-08 - Transformers What is ViT short for?
Vision transformer
84
# DL-08 - Transformers Who first published vision transformers for imagenet?
Dosovitskiy et al. from Google Brain
85
# DL-08 - Transformers When were vision transformers first published?
2020
86
# DL-08 - Transformers What is the architecture of vision transformers?
(See image)
87
# DL-08 - Transformers How are images preprocessed for use in a vision transformer?
Data is split into small patches, e.g. 16x16.
88
# DL-08 - Transformers Label the masked parts of the image.
(See image)
89
# DL-08 - Transformers Label the masked parts of the image.
(See image)
90
# DL-08 - Transformers Label the masked parts of the image.
(See image)
91
# DL-08 - Transformers Label the masked parts of the image.
(See image)
92
# DL-08 - Transformers Label the masked parts of the image.
(See image)
93
# DL-08 - Transformers Label the masked parts of the image.
(See image)
94
# DL-08 - Transformers What's depicted in the image?
A ViT (Vision transformer).