Topic 7: Advanced Sequence Learning with Transformers Flashcards
(26 cards)
What is attention?
We want to add a mechanism that can keep the memory, during the input-output link, because in sequence learning, the memory decays over time.
Attention mechanisms are inspired by the ability of humans (and other animals) to selectively pay more attention to salient details and ignore details that are less important in the moment. Having access to all information but focusing on only the most relevant information helps to ensure that no meaningful details are lost while enabling efficient use of limited memory and time.
What are some attention mechanisms/functions?
Self-Attention Mechanism:
- Commonly used in tasks involving sequences, such as natural language processing
- allows the model to weigh the importance of each element in the sequence concerning all the other elements
- The Transformer model, for instance, relies heavily on self-attention.
Scaled Dot-Product Attention:
- key component of the Transformer architecture
- calculates attention scores by taking the dot product of a query vector and the keys, followed by scaling and applying a softmax function.
- This type of attention mechanism is highly efficient and has contributed to the success of Transformers in various applications.
Multi-Head Attention:
- extends the idea of attention by allowing the model to focus on different parts of the input simultaneously
- achieves this by using multiple sets of learnable parameters, each generating different attention scores
- this technique enhances the model’s ability to capture complex relationships within the data.
Location-Based Attention:
- often used in image-related tasks. It assigns attention scores based on the spatial location of elements in the input
- This can be particularly useful for tasks like object detection and image captioning.
What are transformers?
Instead of RNNs (which process sequences step-by-step), Transformers use self-attention and positional encoding to process the entire sequence in parallel.
instead of using RNN, we use windowed input-output
Simplified transformer: adding/concat positional encoding with attention, multiplied by feed forward activation.
The transformer block
- Signal is sequentially modified by
transformer blocks
- Implications: Focus on computations done by block components
What is the “Attention is all you need” paper about?
What is the alignment problem?
The alignment problem revolves around the challenge of aligning the objectives and decision-making capabilities of machine learning models with human values. While a machine learning model may be designed to optimize certain objectives, it can sometimes diverge from human values and produce undesirable outcomes.
WE want the same amount of input as output
Describe Neural Machine Translation (NMT)
Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.
unaligned sequence-to-sequence: when the input and output are not of the same length. we also add the context vector, which works as a bottleneck (a middle layer)
-
Classic seq-to-seq models use a single context vector (
c
) to represent the input. this creates a bottleneck and makes it hard to align input and output tokens. - Attention mechanism solves this by letting the decoder focus on different parts of the input at each step — improving translation and alignment.
- The alignment matrix shows how attention links input words to output words.
- Even with attention, alignment is still a challenge. it’s learned, not guaranteed to be accurate.
What is an encoder and a decoder?
Encoder: a function or system that takes raw input data and transforms it into a compact, meaningful representation. Think of the encoder as summarizing or compressing the input.
- Input: Data in its original form (e.g., text, image, audio, etc.)
- Output: A latent or compressed representation that captures essential information
Often used to reduce dimensionality, extract features, or hide noise
Decoder: the inverse of the encoder: it takes that encoded representation and tries to reconstruct the original data or generate new data based on it. Think of the decoder as rebuilding or generating from a compressed summary.
- Input: The latent code or embedding from the encoder
- Output: The predicted or reconstructed version of the original data
Used to recover, translate, or generate new forms of the original input
Encoders and decoders force a model to learn useful internal representations:
This is key for tasks like translation, summarization, generation, compression, denoising, and embedding learning.
The encoder-decoder framework is the backbone of models like Autoencoders, Transformers, Seq2Seq, BERT, GPT, and more.
Describe multi-head attention
In the Transformer, the Attention module repeats its computations multiple times in parallel. Each of these is called an Attention Head. The Attention module splits its Query, Key, and Value parameters N-ways and passes each split independently through a separate Head. All of these similar Attention calculations are then combined together to produce a final Attention score. This is called Multi-head attention and gives the Transformer greater power to encode multiple relationships and nuances for each word.
Describe the training of a transformer
Processes the input sentence (e.g., “Thinking Machines”) in multiple layers:
- Adds positional encoding so the model knows word order.
- Each layer has:
- Self-Attention: lets each word focus on other words.
- Feed Forward: processes and transforms features.
- Add & Normalize: stabilizes training.
Generates the output sentence one word at a time:
- Each layer has:
- Masked Self-Attention: only looks at earlier words in the output (to preserve causality).
- Encoder-Decoder Attention: looks at encoder output to “translate” from input to output.
- Feed Forward and Add & Normalize like in encoder.
Transformers are trained to map input sequences to output sequences using an encoder-decoder architecture. The encoder processes the input to produce context-rich representations, and the decoder generates outputs one token at a time, attending to both the previously generated tokens and the encoder’s output. During training, teacher forcing is used, feeding the correct previous token rather than the model’s own output. The model learns by minimizing the cross-entropy loss between its predictions and the true output tokens. Transformers rely on self-attention, positional encoding, and feed-forward layers, and thanks to their fully parallel structure, they can be trained efficiently on large datasets using gradient descent.
Explain how we move towards LLMs
LLMs capture rich, contextual representations of language, generalize across tasks, and perform well even with minimal task-specific data — all thanks to the Transformer’s scalability and unsupervised learning paradigm.
What are the limitations of shallow embeddings?
word2vec or GloVe: they’re shallow and context-free. they have the same representation for a word, e.g. “bank” if it’s used in “bank account” or “bank of a river”.
- shallow embeddings are static (they don’t change depending on the context)
- embedding for a word doesn’t reflect how its meaning changes in context
What is attention contributing compared to context in RNNs?
Attention replaces the fixed context bottleneck with a flexible, token-wise focus mechanism. Instead of summarizing the input into one vector, it attends differently for each output step, enabling better performance, especially for longer sequences.
Feature: Classic RNN context
Context representation: Single fixed-size vector (c) from final encoder state
Memory capacity: Limited, bottlenecked
Uniform or final hidden state: Uniform or final hidden state
Long sequence handling: Struggles with long dependencies
Interpretability: Hard to interpret
Feature: Attention mechanism
Context representation: Dynamic, weighted sum of all encoder states
Memory capacity: Larger, full input sequence is available
Focus on input: Learns where to focus via alignment scores
Long sequence handling: Better, attends to relevant tokens regardless of distance
Interpretability: Alignment matrix gives interpretable attention weights
What are contextual word embeddings?
contextual word embeddings
- intuition: the representation of the meaning of a word should be different IN different contexts!!!!!
- different vector for each word expressing different meanings depending on the surrounding words
- how do you compute the contextual embeddings
- attention RNN: deep contextualised word representation on whole sentence, not just neighbour words: e.g. “I accessed the bank account” – “bank” depends on “accessed” and “account”
- Embeddings from Language Models (ELMo):
- ELMo models complex characteristics of word use (syntax & semantics) …
- …and how these uses vary across linguistic contexts (polysemy/ambiguity)
- (Unsupervised)/Self-supervised training (with 𝑦_𝑡 = 𝑥_𝑡−1):
What is self-attention?
non-seq2seq: Non-sequential models handleunordered data, where the relationships or patterns within the data do not depend on sequence or time.
some non-seq2seq tasks are: Summarisation, Description generation i.e. Image captioning
Learn which self-activation yields highest correlation between current words and previous representation. this results in Better representations!
- Self-attention allows each word to focus on other words in the same input.
- Learns context-aware representations (e.g. “run” depends on “criminal”).
- Helpful beyond seq2seq: also used in summarisation, captioning, etc.
- Enables parallel computation and better global understanding.
Self-attention resolves ambiguity, because it allows each word to focus on other words
example: Self-attention for “it” depends on the earlier representation of “animal” vs. “street
Compare the architectures for sequence processing
recurrence
a logarithmic decay of context
expensive sequential computation
convolution
- context as fixed-width window
- computationally efficient
self-attention
- direct connection of output with all inputs
- complexity depended on sequence length
What are residual connections?
these are shortcut connections that perform identity mapping
residual blocks:
- shallow mappings over feature transformation
- same number of parameters (weights)
residual connections can span arbitrary depth:
- it can skip layers due to its shortcut connections
- it maintains reference to un-transformed information
- it’s more robust against perturbations in inputs
Why is multi-head attention effective?
- no recurrence is needed
- it has self-attention, so it connects the embeddings and positional information
- multiple heads:
- learns different type of relations: structure, semantic
- example: next-word, verb, subject etc.
- transformer: provides filter kernels of arbitrary content AND shaoe!
- compared to CNN: transformer generalises representation transformation
What is positional encoding?
in classic attention, we have the words distributed over an input-window and ignore the order of the words
if we want to add a kind of word-order counter to the embeddings, we can use positional encoding.
What are residual stream in transformers?
The residual stream
- An information-processing perspective of Read/Write Operations
- Implications: Provides a view focusing on the representational space
Explain the semi-supervised Training with Transfer Learning
often a two stage process:
- Unsupervised pre-training
- Related to word2vec: Learn embedding into (lower) transformer blocks
- Typical tasks: language modelling or sentence prediction for unsupervised
corpus U, maximise likelihood: $L_1(u) = \sum_i \log P(u_i|u_{i-k}, …, u_{i-1}; \Theta)$
- Supervised fine-tuning
- Continue learning on downstream task (possibly fix k lower blocks)
- Crucial modification: adapt token representation
- Often the only feasible step for normal labs (with no massive TPU cluster):
Explain the self-supervised Training of Transformers (for NLP)
the training approach for NLP:
- self-supervised, e.g. next token prediction
- teacher forcing
- loss: L_CE = -\sum_w \in V y_t * [w] * log \hat y_t * [w]$
- cross-entropy between the predicted and the correct word distribution
reminder - rnn training
- inherently serial/sequential. nothing can be done in parallel
transformer training
- inherently parallel
- allows for large input/context windows, 1024 – 4096 tokens (GPT4)
What is the efficiency of the transformer architecture?
Reminder: Self-attention (& Transformer)
- Direct connection of output with all inputs
- Complexity depended on sequence length N
Efficient transformer variants exist:
- Fixed/Factorised localised patterns & Learnable sparse attention patterns: attend only to a subset of in put tokens
- Low rank & kernel methods: Approximate attention matrix with low-rank matrices
- Random Gaussian projections
- Radial basis function kernel as unbiased approximation
- Memory & Recurrence methods: Access several tokens from global memory or locally via recurrence
How do we use Transformers for Computer Vision?
- Image chopped into 16x16 patches, instead of filtering whole image (CNN)
- Position embedding and different levels of relationships learned with multi-head attention
How do we use Transformers for Natural Language Processing: BERT and GPT
Contextual autoregressive models, based on [Vaswani et al.] architecture
Bidirectional Encoder Representations from Transformers (BERT)
- Focus on transformer encoder blocks
- Contextual model (like other Transformer-based)
- Like ELMO: deeply bidirectional (autoencoder) model with attention (here: transformers)
- Unsupervised pre-trained on massive plain text
- English Wikipedia, 3Billon tokens
- Books Corpora, >11k books, 8/12+55B t.
- Pre-training: Masked Multi-task Learning