DL-08 - Transformers Flashcards
DL-08 - Transformers
What are some problems with RNNs/LSTMs? (4)
- Difficult to train.
- Very long gradient paths.
- Transfer learning never really works.
- Recurrence is against the principle of parallel computation.
DL-08 - Transformers
What is the name of the paper where transformers were introduced?
Attention is All You Need
DL-08 - Transformers
When was the transformers paper (Attention is All You Need) published?
2017
DL-08 - Transformers
Who were the authors of the transformers paper (Attention is All You Need)
Vaswani et al.
DL-08 - Transformers
What do transformers use instead of recurrence? (2)
- Context windows (input more data at the same time)
- self-attention
DL-08 - Transformers
In what areas are transformers currently very good? (2)
- NLP
- Computer vision
DL-08 - Transformers
What does a transformer do? (2)
- Encodes an input into a single vector
- Decodes the vector back into output
DL-08 - Transformers
Does transformers use recursion?
No, they avoid it.
DL-08 - Transformers
Why can encoders be so fast?
No recursion -> parallel computation.
DL-08 - Transformers
What are the main characteristics of transformers? (3)
- non-sequential
- self-attention
- positional encoding
DL-08 - Transformers
Describe what is meant when we say transformers are non-sequential.
Sentences are processed as a whole, rather than word by word.
DL-08 - Transformers
“Sentences are processed as a whole, rather than word by word.”
What is this property called?
Non-sequential.
DL-08 - Transformers
Describe self-attention.
A new unit used to compute similarity scores between words in a sentence.
DL-08 - Transformers
Describe positional encoding.
Encodes information related to a position of a token in a sentence.
DL-08 - Transformers
“Encodes information related to a position of a token in a sentence.”
What is this called?
positional encoding
DL-08 - Transformers
“A new unit used to compute similarity scores between words in a sentence.”
What is this called?
Self-atttention.
DL-08 - Transformers
What is the method transformers use to understand relevant words while processing a current word?
Self-atttention.
DL-08 - Transformers
Why don’t transformers suffer from short-term memory?
Because they use self-attention mechanisms, allowing them to take the entire input sequence into account simultaneously.
DL-08 - Transformers
What parts does “encoder embedding” consist of?
- Word/input embedding
- Positional embedding
DL-08 - Transformers
What does positional embedding do?
It injects positional information (distance between different words) into the input embeddings.
DL-08 - Transformers
What functions does the “Attention is all you need” paper use for positional encoding?
Sin/cos
DL-08 - Transformers
In the image, what is
- d_model
- i
- pos
(See image)
- d_model: Embedding size
- i: Depends on the position in the embedding dimension
- pos: Position index in the incoming sequence.
DL-08 - Transformers
How is positional information added to the embeddings?
They’re added element-wise (Addition).
DL-08 - Transformers
What are the sub-modules of the encoder? (2)
- Multi-headed attention
- Fully connected feed forward network