Transformer Flashcards

(14 cards)

1
Q

What limitations of RNNs and LSTMs motivated the Transformer architecture?

A

They are hard to train (time-consuming), not parallelizable, and struggle with long-range dependencies due to vanishing gradients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the key components of the full Transformer architecture mentioned?

A

Layer Normalisation, Multi-Head Attention, Masked Multi-Head Attention, and Positional Encoding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define self-attention in the context of Transformers.

A

Attention where queries, keys, and values all come from the same sequence, computing dependencies within the input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is cross-attention?

A

Attention where queries come from one set (e.g., decoder inputs) and keys/values from another set (e.g., encoder outputs).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How are query, key, and value vectors obtained in self-attention?

A

By multiplying each input vector x by three learned parameter matrices: Q, K, and V.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What advantage does multi-head attention provide?

A

Captures different types of relationships in parallel by running multiple self-attention operations (heads) simultaneously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the purpose of layer normalization in Transformers?

A

It stabilises and accelerates training by normalising inputs across features, independent of batch size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is masked multi-head attention used in the decoder?

A

To enforce autoregressive property by preventing positions from attending to future tokens during training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does permutation equivariance mean for self-attention?

A

Self-attention treats all positions equally, so permuting inputs results in correspondingly permuted outputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why must we break permutation equivariance in Transformers for language tasks?

A

Because language meaning depends on word order; positional order must be encoded.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is positional information added to Transformer inputs?

A

By adding learned or fixed positional encodings to the input embeddings before attention layers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do Transformers compare to RNNs in sequence tasks?

A

They allow parallel processing, avoid vanishing gradients, and often achieve higher accuracy with lower training time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

List two advantages of Transformers over RNNs.

A

Support for parallel training and the ability to capture long-range dependencies without recurrence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Give one example of a successful application of Transformers.

A

Language translation, image captioning, or other large-scale sequence modeling tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly