Week 7: Advanced Sequence Learning with Transformers Flashcards
(17 cards)
How do 1D CNNs work for sequence learning and what are their key characteristics?
1D CNNs apply convolution and pooling operations along the time dimension of sequences, treating sequences as “sliding windows” with fixed width.
How It Works:
Input: Sequences represented as multi-channel data (e.g., words as 6D embeddings)
Convolution: 1D kernels slide over time dimension to detect local patterns
Pooling: Max pooling over time reduces sequence length while preserving important features
Output: Many-to-one classification (e.g., sentiment analysis)
Key Properties:
Local patterns: Detects n-gram-like features (e.g., “not good”, “very bad”)
Translation invariant: Same pattern detected regardless of position in sequence
Parallel processing: Much faster than RNNs (no sequential dependency)
Fixed receptive field: Limited context window unlike RNNs
For NLP:
Words: Represented as D-dimensional embeddings (could use Word2Vec)
Channels: Each embedding dimension becomes a channel
Filters: Learn to detect meaningful phrase patterns
Advantage vs RNNs:
Speed - can process entire sequence in parallel rather than sequentially
What is a problem with many-to-many with RNNS?
Problem: In sequence-to-sequence tasks, it’s hard to learn which input parts correspond to which output parts
RNN limitation: Information gets compressed into a single hidden state, causing information loss
Solution: Attention allows the model to “look back” at all input positions when generating each output
What is the attention mechanism and how does it solve the alignment problem in sequence-to-sequence learning?
Alignment Problem - In sequence-to-sequence tasks, it’s difficult to learn which input parts correspond to which output parts. Traditional RNNs compress all information into a single hidden state, causing information loss.
The Solution - Attention:
Instead of using only the final encoder state, attention allows the decoder to look back at ALL encoder hidden states when generating each output.
How Attention Works:
Query (q): Current decoder state (what we’re trying to generate)
Keys (k): All encoder hidden states
Values (v): The actual information to extract
Attention weights (α): Learned scores showing how much to focus on each input position
Context vector: Weighted sum of all input information
Mathematical Formula:
Attention weights: α(q,k) = softmax(score(q,k))
Context: Σ αᵢ × vᵢ (weighted combination of all values)
Key Benefits:
No information bottleneck: Access to all input information
Alignment: Can see which input words influence each output word
Long-range dependencies: Direct connections across sequence
What are Query, Key, and Value in the attention mechanism and how do they work together?
The Three Components:
Query (Q): “What am I looking for?”
Current decoder state - represents what the decoder is trying to generate right now
Example: “I’m trying to translate ‘cat’ - what input information do I need?”
Key (K): “What information is available?”
All encoder hidden states (h₁, h₂, h₃, h₄, h₅)
Like having labels for each piece of available information
Example: “This is about ‘dog’, this is about ‘runs’, this is about ‘fast’”
Value (V): “What is the actual information?”
The actual content/features from each encoder hidden state
The information you want to extract once you decide what’s relevant
How They Work Together:
Compare: Query matches against all Keys to find relevance
Score: Calculate attention weights (how much to focus on each Key)
Combine: Use weights to create weighted sum of corresponding Values
Example:
Query: “Generating ‘cat’”
Keys: [“the”, “big”, “cat”, “runs”, “fast”]
Attention scores: [0.1, 0.2, 0.8, 0.05, 0.05] ← focuses on “cat”
Result: Weighted combination emphasizing “cat” information
What is self-attention and how does it improve word representations?
Definition:
Self-attention allows each word in a sentence to attend to (focus on) other words in the SAME sentence to build better context-aware representations.
How It Works:
Query, Keys, Values: All come from the same input sequence
Each word asks: “Which other words in this sentence help me understand my meaning?”
Similarity scores: Computed between each word and all other words
Result: Context-aware word representations
Example:
In “The FBI is chasing a criminal”:
“chasing” attends strongly to “FBI” (the agent) and “criminal” (the target)
“it” in later sentences can attend to “animal” vs “street” to resolve ambiguity
Benefits:
Resolves ambiguity: Words get meaning from context
Better representations: Each word embedding includes relevant context
Parallel processing: Unlike RNNs, all words can be processed simultaneously
Long-range dependencies: Direct connections between any two words
Key Insight:
Instead of fixed word embeddings, each word gets a dynamic representation based on its current context.
Can you reflect the components of the scaled dot-product attention regarding their effect?
The scaled dot-product attention has three key components:
1. Dot Product (Q·K):
Effect: Measures similarity between query and keys
How: Higher dot product = more similar = more relevant
Purpose: Finds which keys are most relevant to the current query
- Scaling Factor (√n):
Effect: Prevents attention scores from becoming too large
Why needed: Without scaling, large dot products cause extreme softmax outputs
Result: Avoids vanishing gradients and maintains stable training
- Softmax Normalization:
Effect: Converts similarity scores into a probability distribution
Result: All attention weights sum to 1
Purpose: Creates weighted combination where model focuses most on relevant parts
Overall Effect:
The combination creates a stable, efficient attention mechanism that can determine how much to focus on each input position when processing the current position.
Formula:
Attention(Q,K,V) = softmax(QK^T / √d_k)V
Compare the three main architectures for sequence processing: Recurrence, Convolution, and Self-attention.
Recurrence (RNNs/LSTMs):
Connectivity: Sequential processing, each layer depends on previous
Context: Logarithmic decay - distant context gets weaker
Computation: Expensive sequential computation (can’t parallelize)
Memory: Hidden state carries information forward
Convolution (1D CNNs):
Connectivity: Local connections via fixed-width windows
Context: Fixed-width context window (limited range)
Computation: Computationally efficient (parallel processing)
Limitation: Can’t see beyond kernel size without stacking many layers
Self-attention (Transformers):
Connectivity: Direct connection between ALL input positions
Context: Direct access to entire sequence
Computation: Complexity depends on sequence length (quadratic)
Advantage: Parallel processing + long-range dependencies
Trade-offs:
RNN: Good for long sequences, but slow training
CNN: Fast but limited context
Self-attention: Best of both worlds, but expensive for very long sequences
What are the key components of the Transformer architecture and how do they work together?
Core Innovation:
“Attention is All You Need” - Transformers use ONLY attention mechanisms and feed-forward networks, no RNNs or CNNs.
Encoder (Left Side):
Input: Word embeddings + positional encoding
Layers: Stack of identical layers, each with:
Self-Attention: Each word attends to all other input words
Feed Forward: Neural network processing
Add & Normalize: Residual connections + layer normalization
Decoder (Right Side):
Similar structure but with THREE attention mechanisms:
Masked Self-Attention: Attends only to previous words (prevents looking ahead)
Encoder-Decoder Attention: Attends to encoder output (translation mechanism)
Feed Forward: Same as encoder
Key Components:
Positional Encoding: Injects position information (since no sequential processing)
Multi-layer: Multiple encoder and decoder layers stacked
Parallel Processing: All positions processed simultaneously
Why Revolutionary:
Faster training: Parallel vs sequential processing
Better long-range dependencies: Direct attention connections
State-of-the-art results: Outperformed RNN/CNN models
Why do Transformers need positional encoding and how does it work?
Positional Encoding - Add position information to word embeddings so the model knows where each word is in the sequence.
How It Works:
Mathematical encoding: Uses sine and cosine functions with different frequencies
Unique fingerprint: Each position i gets a unique positional vector p_i
Added to embeddings: final_embedding = word_embedding + positional_encoding
Key Properties:
Deterministic: Same position always gets same encoding
Relative positions: Model can learn relationships between positions
No learned parameters: Uses fixed mathematical functions
Why Necessary:
Without positional encoding, “The cat sat on the mat” would be processed the same as “mat the on sat cat The” - just a bag of words with attention!
What is multi-head attention and why is it better than single-head attention?
Instead of using one attention mechanism, use multiple “heads” in parallel - each learning different types of relationships in the data.
How It Works:
Multiple heads: Each head has its own Q, K, V transformations
Parallel processing: All heads compute attention simultaneously
Different focus: Each head learns different linguistic relationships
Combine results: Concatenate all head outputs and apply final linear transformation
What Each Head Learns:
Head 1: Subject-verb relationships (“dog” ↔ “barks”)
Head 2: Adjective-noun relationships (“big” ↔ “dog”)
Head 3: Long-distance dependencies
Head 4: Syntactic patterns
etc.
Benefits:
Richer representations: Captures multiple types of relationships simultaneously
Specialization: Each head focuses on different aspects of language
Robustness: If one head fails, others still provide useful information
Same computational cost: Heads run in parallel
Formula:
MultiHead(Q,K,V) = Concat(head₁, head₂, …, headₕ)W^O
where headᵢ = Attention(QWᵢQ, KWᵢK, VWᵢV)
What are residual connections in Transformers and why are they important?
Residual connections (also called skip connections) are shortcuts that add the input directly to the output of a layer: output = F(x) + x
The Problem They Solve:
Vanishing gradient problem - In deep networks, gradients become weaker as they flow backward through many layers, making training difficult.
How are Transformers trained for NLP using self-supervised learning, and what advantage do they have over RNNs?
Self-Supervised Training:
Next token prediction - Given a sequence of words, predict the next word in the sequence.
Training Process:
Input: “So long and thanks for…”
Target: “long and thanks for all”
Loss: Cross-entropy between predicted and correct word distributions
Self-supervised: No external labels needed - text provides its own supervision
Key Advantage Over RNNs:
RNN Training:
Inherently serial: Must process “So” → “long” → “and” → “thanks” sequentially
Slow: Each step depends on the previous one
Transformer Training:
Inherently parallel: Can predict all positions simultaneously
Fast: All predictions computed at once using attention
Scalable: Allows large context windows (1024-4096 tokens in GPT4)
Why This Works:
Teacher forcing: During training, use actual previous words (not predictions)
Masking: Prevent model from “cheating” by looking at future tokens
Large scale: Can process much longer sequences efficiently
Memory Aid:
Transformers = “Parallel prediction machine” vs RNNs = “Sequential word-by-word processing”!
How do Vision Transformers apply the Transformer architecture to computer vision?
How It Works:
Image → Patches: Divide image into 16x16 pixel patches
Patch Embedding: Each patch becomes a vector (like word embeddings)
Position Embedding: Add positional information to patches
Transformer Processing: Apply standard Transformer encoder blocks
Classification: Use output for image classification
Key Innovation:
“An image is worth 16x16 words” - instead of words in a sentence, we have patches in an image.
Advantages:
Attention across patches: Can focus on relevant image regions
Long-range dependencies: Direct connections between distant patches
Transfer learning: Can use pre-trained language model techniques
Scalability: Benefits from large datasets like language models
Emerging Attention Maps:
Different transformer layers learn to attend to different visual relationships and levels of detail, similar to how CNNs learn hierarchical features.
vs CNNs:
CNNs: Local receptive fields, hierarchical feature learning
ViTs: Global attention, direct patch relationships
Memory Aid:
ViT = “Treating image patches like words in a sentence” - apply text transformers to vision!
What are the key differences between BERT and GPT transformer architectures?
BERT (Bidirectional Encoder Representations from Transformers):
Architecture: Uses encoder blocks from Transformer
Bidirectional: Can see context from both left AND right sides
Training: Masked Language Modeling - predict missing words in sentences
Use case: Understanding tasks (classification, question answering, sentiment analysis)
Example training: “The [MASK] is chasing the mouse” → predict “cat”
GPT (Generative Pre-Training):
Architecture: Uses decoder blocks from Transformer
Autoregressive: Can only see previous context (left-to-right)
Training: Next token prediction - predict what comes next
Use case: Generation tasks (text completion, creative writing, dialogue)
Example training: “The cat is chasing” → predict “the”
Key Technical Differences:
BERT: Encoder-only, bidirectional attention, masked training
GPT: Decoder-only, causal (masked) attention, autoregressive training
Analogy:
BERT: Like a student who can see the whole sentence with some words blanked out
GPT: Like a student writing a story one word at a time, only seeing what came before
Training Data:
BERT: Books, Wikipedia (~3.3B tokens)
GPT: Larger web corpora (GPT-3: ~410B tokens)
Memory Aid:
BERT = “Fill in the blank expert”, GPT = “Story continuation expert”!
What are the key factors that determine Large Language Model performance and how do they scale?
Three Key Scaling Factors:
1. Model Size (Parameters):
Small models: Millions of parameters (GPT-1: 117M)
Large models: Hundreds of billions (GPT-3: 175B, MT-530B: 530B)
Trend: Exponential growth in model size over time
- Dataset Size (Training Data):
Evolution: From small corpora to massive web crawls
Examples:
Shakespeare: 884K tokens
Books: ~11K books
CommonCrawl: 410B+ tokens (entire web)
Key insight: More data generally leads to better performance
- Compute Budget:
Training cost: Measured in GPU-days or compute hours
Trade-off: Larger models need more compute but perform better
Scaling laws: Performance improves predictably with more compute
Scaling Laws (Kaplan et al.):
Performance scales as power laws with:
N (number of parameters)^α
C (compute budget)^β
D (dataset size)^γ
Key Insight:
“Scale is all you need” - Simply making models bigger with more data and compute consistently improves performance, leading to emergent capabilities
Summarize the key innovations and limitations of the Transformer architecture for sequence learnin
Summarize the key innovations and limitations of the Transformer architecture for sequence learning.
Back (Answer)
Key Innovation - Attention Mechanism:
Core idea: Learn to attend to specific items in the history at any time step
Solves alignment problem: Direct connections between any input/output positions
Multiple variants: Different attention types for different complexities
Related to: Bayesian & intractable latent alignment (probabilistic foundations)
Major Benefits:
Mitigates vanishing gradient problem: Direct connections preserve gradients
Solves information decay: Access to all positions, not just compressed final state
Most effective improvement: Revolutionary advancement for sequence learning in the last decade
Multi-level relationships: Learn arbitrary relations in data on multiple levels
Transformer Success Factors:
Parallel processing: Unlike RNNs, can process entire sequences simultaneously
Scalability: Benefits from massive models and datasets
Transfer learning: Pre-train then fine-tune paradigm
Data engineering: Success depends on vast data selection & engineering
Key Limitations:
Finite sequence length: Structural shortcoming for very long sequences
Computational requirements: Need massive computation for huge models
Quadratic complexity: Attention scales O(n²) with sequence length
Overall Impact:
Transformers became the dominant architecture for sequence learning, enabling modern LLMs and revolutionizing NLP through the combination of attention, scale, and transfer learning.