Topic 7: Advanced Sequence Learning with Transformers Flashcards by Niko Dice

What is attention?

We want to add a mechanism that can keep the memory, during the input-output link, because in sequence learning, the memory decays over time.
Attention mechanisms are inspired by the ability of humans (and other animals) to selectively pay more attention to salient details and ignore details that are less important in the moment. Having access to all information but focusing on only the most relevant information helps to ensure that no meaningful details are lost while enabling efficient use of limited memory and time.

How well did you know this?

Not at all

Perfectly

What are some attention mechanisms/functions?

Self-Attention Mechanism:
- Commonly used in tasks involving sequences, such as natural language processing
- allows the model to weigh the importance of each element in the sequence concerning all the other elements
- The Transformer model, for instance, relies heavily on self-attention.
Scaled Dot-Product Attention:
- key component of the Transformer architecture
- calculates attention scores by taking the dot product of a query vector and the keys, followed by scaling and applying a softmax function.
- This type of attention mechanism is highly efficient and has contributed to the success of Transformers in various applications.
Multi-Head Attention:
- extends the idea of attention by allowing the model to focus on different parts of the input simultaneously
- achieves this by using multiple sets of learnable parameters, each generating different attention scores
- this technique enhances the model’s ability to capture complex relationships within the data.

Location-Based Attention:
- often used in image-related tasks. It assigns attention scores based on the spatial location of elements in the input
- This can be particularly useful for tasks like object detection and image captioning.

How well did you know this?

Not at all

Perfectly

What are transformers?

Instead of RNNs (which process sequences step-by-step), Transformers use self-attention and positional encoding to process the entire sequence in parallel.

instead of using RNN, we use windowed input-output

Simplified transformer: adding/concat positional encoding with attention, multiplied by feed forward activation.

The transformer block
- Signal is sequentially modified by
transformer blocks
- Implications: Focus on computations done by block components

How well did you know this?

Not at all

Perfectly

What is the “Attention is all you need” paper about?

How well did you know this?

Not at all

Perfectly

What is the alignment problem?

The alignment problem revolves around the challenge of aligning the objectives and decision-making capabilities of machine learning models with human values. While a machine learning model may be designed to optimize certain objectives, it can sometimes diverge from human values and produce undesirable outcomes.
WE want the same amount of input as output

How well did you know this?

Not at all

Perfectly

Describe Neural Machine Translation (NMT)

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

unaligned sequence-to-sequence: when the input and output are not of the same length. we also add the context vector, which works as a bottleneck (a middle layer)

Classic seq-to-seq models use a single context vector (c) to represent the input. this creates a bottleneck and makes it hard to align input and output tokens.
Attention mechanism solves this by letting the decoder focus on different parts of the input at each step — improving translation and alignment.
The alignment matrix shows how attention links input words to output words.
Even with attention, alignment is still a challenge. it’s learned, not guaranteed to be accurate.

How well did you know this?

Not at all

Perfectly

What is an encoder and a decoder?

Encoder: a function or system that takes raw input data and transforms it into a compact, meaningful representation. Think of the encoder as summarizing or compressing the input.
- Input: Data in its original form (e.g., text, image, audio, etc.)
- Output: A latent or compressed representation that captures essential information

Often used to reduce dimensionality, extract features, or hide noise

Decoder: the inverse of the encoder: it takes that encoded representation and tries to reconstruct the original data or generate new data based on it. Think of the decoder as rebuilding or generating from a compressed summary.
- Input: The latent code or embedding from the encoder
- Output: The predicted or reconstructed version of the original data

Used to recover, translate, or generate new forms of the original input

Encoders and decoders force a model to learn useful internal representations:

This is key for tasks like translation, summarization, generation, compression, denoising, and embedding learning.

The encoder-decoder framework is the backbone of models like Autoencoders, Transformers, Seq2Seq, BERT, GPT, and more.

How well did you know this?

Not at all

Perfectly

Describe multi-head attention

In the Transformer, the Attention module repeats its computations multiple times in parallel. Each of these is called an Attention Head. The Attention module splits its Query, Key, and Value parameters N-ways and passes each split independently through a separate Head. All of these similar Attention calculations are then combined together to produce a final Attention score. This is called Multi-head attention and gives the Transformer greater power to encode multiple relationships and nuances for each word.

How well did you know this?

Not at all

Perfectly

Describe the training of a transformer

Processes the input sentence (e.g., “Thinking Machines”) in multiple layers:

Adds positional encoding so the model knows word order.
Each layer has:
- Self-Attention: lets each word focus on other words.
- Feed Forward: processes and transforms features.
- Add & Normalize: stabilizes training.

Generates the output sentence one word at a time:

Each layer has:
- Masked Self-Attention: only looks at earlier words in the output (to preserve causality).
- Encoder-Decoder Attention: looks at encoder output to “translate” from input to output.
- Feed Forward and Add & Normalize like in encoder.

Transformers are trained to map input sequences to output sequences using an encoder-decoder architecture. The encoder processes the input to produce context-rich representations, and the decoder generates outputs one token at a time, attending to both the previously generated tokens and the encoder’s output. During training, teacher forcing is used, feeding the correct previous token rather than the model’s own output. The model learns by minimizing the cross-entropy loss between its predictions and the true output tokens. Transformers rely on self-attention, positional encoding, and feed-forward layers, and thanks to their fully parallel structure, they can be trained efficiently on large datasets using gradient descent.

How well did you know this?

Not at all

Perfectly

Explain how we move towards LLMs

LLMs capture rich, contextual representations of language, generalize across tasks, and perform well even with minimal task-specific data — all thanks to the Transformer’s scalability and unsupervised learning paradigm.

How well did you know this?

Not at all

Perfectly

What are the limitations of shallow embeddings?

word2vec or GloVe: they’re shallow and context-free. they have the same representation for a word, e.g. “bank” if it’s used in “bank account” or “bank of a river”.

shallow embeddings are static (they don’t change depending on the context)
embedding for a word doesn’t reflect how its meaning changes in context

How well did you know this?

Not at all

Perfectly

What is attention contributing compared to context in RNNs?

Attention replaces the fixed context bottleneck with a flexible, token-wise focus mechanism. Instead of summarizing the input into one vector, it attends differently for each output step, enabling better performance, especially for longer sequences.

Feature: Classic RNN context
Context representation: Single fixed-size vector (c) from final encoder state
Memory capacity: Limited, bottlenecked
Uniform or final hidden state: Uniform or final hidden state
Long sequence handling: Struggles with long dependencies
Interpretability: Hard to interpret

Feature: Attention mechanism
Context representation: Dynamic, weighted sum of all encoder states
Memory capacity: Larger, full input sequence is available
Focus on input: Learns where to focus via alignment scores
Long sequence handling: Better, attends to relevant tokens regardless of distance
Interpretability: Alignment matrix gives interpretable attention weights

How well did you know this?

Not at all

Perfectly

What are contextual word embeddings?

contextual word embeddings

intuition: the representation of the meaning of a word should be different IN different contexts!!!!!
different vector for each word expressing different meanings depending on the surrounding words
how do you compute the contextual embeddings
attention RNN: deep contextualised word representation on whole sentence, not just neighbour words: e.g. “I accessed the bank account” – “bank” depends on “accessed” and “account”
Embeddings from Language Models (ELMo):
- ELMo models complex characteristics of word use (syntax & semantics) …
- …and how these uses vary across linguistic contexts (polysemy/ambiguity)
- (Unsupervised)/Self-supervised training (with 𝑦_𝑡 = 𝑥_𝑡−1):

How well did you know this?

Not at all

Perfectly

What is self-attention?

non-seq2seq: Non-sequential models handleunordered data, where the relationships or patterns within the data do not depend on sequence or time.

some non-seq2seq tasks are: Summarisation, Description generation i.e. Image captioning

Learn which self-activation yields highest correlation between current words and previous representation. this results in Better representations!

Self-attention allows each word to focus on other words in the same input.
Learns context-aware representations (e.g. “run” depends on “criminal”).
Helpful beyond seq2seq: also used in summarisation, captioning, etc.
Enables parallel computation and better global understanding.

Self-attention resolves ambiguity, because it allows each word to focus on other words

example: Self-attention for “it” depends on the earlier representation of “animal” vs. “street

How well did you know this?

Not at all

Perfectly

Compare the architectures for sequence processing

recurrence
a logarithmic decay of context
expensive sequential computation

convolution
- context as fixed-width window
- computationally efficient

self-attention
- direct connection of output with all inputs
- complexity depended on sequence length

How well did you know this?

Not at all

Perfectly

What are residual connections?

Study These Flashcards

these are shortcut connections that perform identity mapping

residual blocks:
- shallow mappings over feature transformation
- same number of parameters (weights)

residual connections can span arbitrary depth:
- it can skip layers due to its shortcut connections
- it maintains reference to un-transformed information
- it’s more robust against perturbations in inputs

Why is multi-head attention effective?

Study These Flashcards

no recurrence is needed
it has self-attention, so it connects the embeddings and positional information
multiple heads:
- learns different type of relations: structure, semantic
- example: next-word, verb, subject etc.
transformer: provides filter kernels of arbitrary content AND shaoe!
- compared to CNN: transformer generalises representation transformation

What is positional encoding?

Study These Flashcards

in classic attention, we have the words distributed over an input-window and ignore the order of the words

if we want to add a kind of word-order counter to the embeddings, we can use positional encoding.

What are residual stream in transformers?

Study These Flashcards

The residual stream

An information-processing perspective of Read/Write Operations
Implications: Provides a view focusing on the representational space

Explain the semi-supervised Training with Transfer Learning

Study These Flashcards

often a two stage process:

Unsupervised pre-training
1. Related to word2vec: Learn embedding into (lower) transformer blocks
2. Typical tasks: language modelling or sentence prediction for unsupervised
  corpus U, maximise likelihood: $L_1(u) = \sum_i \log P(u_i|u_{i-k}, …, u_{i-1}; \Theta)$
Supervised fine-tuning
1. Continue learning on downstream task (possibly fix k lower blocks)
2. Crucial modification: adapt token representation
3. Often the only feasible step for normal labs (with no massive TPU cluster):

Explain the self-supervised Training of Transformers (for NLP)

Study These Flashcards

the training approach for NLP:
- self-supervised, e.g. next token prediction
- teacher forcing
- loss: L_CE = -\sum_w \in V y_t * [w] * log \hat y_t * [w]$
- cross-entropy between the predicted and the correct word distribution

reminder - rnn training
- inherently serial/sequential. nothing can be done in parallel

transformer training
- inherently parallel
- allows for large input/context windows, 1024 – 4096 tokens (GPT4)

What is the efficiency of the transformer architecture?

Study These Flashcards

Reminder: Self-attention (& Transformer)

Direct connection of output with all inputs
Complexity depended on sequence length N

Efficient transformer variants exist:

Fixed/Factorised localised patterns & Learnable sparse attention patterns: attend only to a subset of in put tokens
Low rank & kernel methods: Approximate attention matrix with low-rank matrices
- Random Gaussian projections
- Radial basis function kernel as unbiased approximation
Memory & Recurrence methods: Access several tokens from global memory or locally via recurrence

How do we use Transformers for Computer Vision?

Study These Flashcards

Image chopped into 16x16 patches, instead of filtering whole image (CNN)
Position embedding and different levels of relationships learned with multi-head attention

How do we use Transformers for Natural Language Processing: BERT and GPT

Study These Flashcards

Contextual autoregressive models, based on [Vaswani et al.] architecture

Bidirectional Encoder Representations from Transformers (BERT)

Focus on transformer encoder blocks
Contextual model (like other Transformer-based)
- Like ELMO: deeply bidirectional (autoencoder) model with attention (here: transformers)
Unsupervised pre-trained on massive plain text
- English Wikipedia, 3Billon tokens
- Books Corpora, >11k books, 8/12+55B t.
Pre-training: Masked Multi-task Learning

What can attention mechanisms do?

Attention mechanism solves this by letting the decoder focus on different parts of the input at each step — improving translation and alignment.

Can you reflect the components of the scaled dot-product attention regarding their effect?

Scaled Dot-Product Attention enables each token to focus on the most relevant parts of the sequence, using learned similarity (via 𝑄 and 𝐾) and retrieving meaningful context (via 𝑉) in a stable, differentiable way. **Query matrix (Q), Effect**: Focuses on attention, defines what each token wants to know about others. Different queries will attend to different parts of the input depending on their encoded meaning. **Key matrix (K), Effect**: Defines what each token has to offer as content to attend to. When dot-producted with the query, it determines how relevant a token is to others. **Dot product 𝑄𝐾^⊤, Effect**: Generates raw attention scores: how strongly each query is related to every key. Higher values mean stronger influence in the final attention result. **Scaling by sqrt(𝑑_𝑘) Effect**: Reduces variance in the dot product scores for numerical stability. Without scaling, large dot products can cause the softmax to produce nearly one-hot vectors → unstable gradients and bad training. **Softmax, Effect**: Turns the attention scores into probability-like weights. Highlights the most relevant tokens, but in a smooth, differentiable way. Enables the model to softly focus attention on key parts of the sequence. **Value matrix (V), Effect**: Provides the actual information content that is selectively combined. The softmaxed attention weights determine how much of each token’s value is used in the final representation. Attention(Q, K, V) = Softmax(QK^t/sqrt(d_k))*V Gives a contextualised representation of each token: a dynamic combination of all other tokens, weighted by relevance. This enables the model to capture relationships, dependencies, and meaningful context, essential for translation, summarisation, etc.

Topic 7: Advanced Sequence Learning with Transformers Flashcards

(26 cards)