Transformer Flashcards

Question 1

Q

What limitations of RNNs and LSTMs motivated the Transformer architecture?

Answer

A

They are hard to train (time-consuming), not parallelizable, and struggle with long-range dependencies due to vanishing gradients.

Question 2

Q

What are the key components of the full Transformer architecture mentioned?

Answer

A

Layer Normalisation, Multi-Head Attention, Masked Multi-Head Attention, and Positional Encoding.

Question 3

Q

Define self-attention in the context of Transformers.

Answer

A

Attention where queries, keys, and values all come from the same sequence, computing dependencies within the input.

Question 4

Q

What is cross-attention?

Answer

A

Attention where queries come from one set (e.g., decoder inputs) and keys/values from another set (e.g., encoder outputs).

Question 5

Q

How are query, key, and value vectors obtained in self-attention?

Answer

A

By multiplying each input vector x by three learned parameter matrices: Q, K, and V.

Question 6

Q

What advantage does multi-head attention provide?

Answer

A

Captures different types of relationships in parallel by running multiple self-attention operations (heads) simultaneously.

Question 7

Q

What is the purpose of layer normalization in Transformers?

Answer

A

It stabilises and accelerates training by normalising inputs across features, independent of batch size.

Question 8

Q

Why is masked multi-head attention used in the decoder?

Answer

A

To enforce autoregressive property by preventing positions from attending to future tokens during training.

Question 9

Q

What does permutation equivariance mean for self-attention?

Answer

A

Self-attention treats all positions equally, so permuting inputs results in correspondingly permuted outputs.

Question 10

Q

Why must we break permutation equivariance in Transformers for language tasks?

Answer

A

Because language meaning depends on word order; positional order must be encoded.

Question 11

Q

How is positional information added to Transformer inputs?

Answer

A

By adding learned or fixed positional encodings to the input embeddings before attention layers.

Question 12

Q

How do Transformers compare to RNNs in sequence tasks?

Answer

A

They allow parallel processing, avoid vanishing gradients, and often achieve higher accuracy with lower training time.

Question 13

Q

List two advantages of Transformers over RNNs.

Answer

A

Support for parallel training and the ability to capture long-range dependencies without recurrence.

Question 14

Q

Give one example of a successful application of Transformers.

Answer

A

Language translation, image captioning, or other large-scale sequence modeling tasks.

Transformer Flashcards

(14 cards)