Transformer Flashcards
(14 cards)
What limitations of RNNs and LSTMs motivated the Transformer architecture?
They are hard to train (time-consuming), not parallelizable, and struggle with long-range dependencies due to vanishing gradients.
What are the key components of the full Transformer architecture mentioned?
Layer Normalisation, Multi-Head Attention, Masked Multi-Head Attention, and Positional Encoding.
Define self-attention in the context of Transformers.
Attention where queries, keys, and values all come from the same sequence, computing dependencies within the input.
What is cross-attention?
Attention where queries come from one set (e.g., decoder inputs) and keys/values from another set (e.g., encoder outputs).
How are query, key, and value vectors obtained in self-attention?
By multiplying each input vector x by three learned parameter matrices: Q, K, and V.
What advantage does multi-head attention provide?
Captures different types of relationships in parallel by running multiple self-attention operations (heads) simultaneously.
What is the purpose of layer normalization in Transformers?
It stabilises and accelerates training by normalising inputs across features, independent of batch size.
Why is masked multi-head attention used in the decoder?
To enforce autoregressive property by preventing positions from attending to future tokens during training.
What does permutation equivariance mean for self-attention?
Self-attention treats all positions equally, so permuting inputs results in correspondingly permuted outputs.
Why must we break permutation equivariance in Transformers for language tasks?
Because language meaning depends on word order; positional order must be encoded.
How is positional information added to Transformer inputs?
By adding learned or fixed positional encodings to the input embeddings before attention layers.
How do Transformers compare to RNNs in sequence tasks?
They allow parallel processing, avoid vanishing gradients, and often achieve higher accuracy with lower training time.
List two advantages of Transformers over RNNs.
Support for parallel training and the ability to capture long-range dependencies without recurrence.
Give one example of a successful application of Transformers.
Language translation, image captioning, or other large-scale sequence modeling tasks.