Week 7: Advanced Sequence Learning with Transformers Flashcards

Question 1

Q

How do 1D CNNs work for sequence learning and what are their key characteristics?

Answer

A

1D CNNs apply convolution and pooling operations along the time dimension of sequences, treating sequences as “sliding windows” with fixed width.
How It Works:

Input: Sequences represented as multi-channel data (e.g., words as 6D embeddings)
Convolution: 1D kernels slide over time dimension to detect local patterns
Pooling: Max pooling over time reduces sequence length while preserving important features
Output: Many-to-one classification (e.g., sentiment analysis)

Key Properties:

Local patterns: Detects n-gram-like features (e.g., “not good”, “very bad”)
Translation invariant: Same pattern detected regardless of position in sequence
Parallel processing: Much faster than RNNs (no sequential dependency)
Fixed receptive field: Limited context window unlike RNNs

For NLP:

Words: Represented as D-dimensional embeddings (could use Word2Vec)
Channels: Each embedding dimension becomes a channel
Filters: Learn to detect meaningful phrase patterns

Advantage vs RNNs:
Speed - can process entire sequence in parallel rather than sequentially

Question 2

Q

What is a problem with many-to-many with RNNS?

Answer

A

Problem: In sequence-to-sequence tasks, it’s hard to learn which input parts correspond to which output parts
RNN limitation: Information gets compressed into a single hidden state, causing information loss
Solution: Attention allows the model to “look back” at all input positions when generating each output

Question 3

Q

What is the attention mechanism and how does it solve the alignment problem in sequence-to-sequence learning?

Answer

A

Alignment Problem - In sequence-to-sequence tasks, it’s difficult to learn which input parts correspond to which output parts. Traditional RNNs compress all information into a single hidden state, causing information loss.
The Solution - Attention:
Instead of using only the final encoder state, attention allows the decoder to look back at ALL encoder hidden states when generating each output.
How Attention Works:

Query (q): Current decoder state (what we’re trying to generate)
Keys (k): All encoder hidden states
Values (v): The actual information to extract
Attention weights (α): Learned scores showing how much to focus on each input position
Context vector: Weighted sum of all input information

Mathematical Formula:

Attention weights: α(q,k) = softmax(score(q,k))
Context: Σ αᵢ × vᵢ (weighted combination of all values)

Key Benefits:

No information bottleneck: Access to all input information
Alignment: Can see which input words influence each output word
Long-range dependencies: Direct connections across sequence

Question 4

Q

What are Query, Key, and Value in the attention mechanism and how do they work together?

Answer

A

The Three Components:
Query (Q): “What am I looking for?”

Current decoder state - represents what the decoder is trying to generate right now
Example: “I’m trying to translate ‘cat’ - what input information do I need?”

Key (K): “What information is available?”

All encoder hidden states (h₁, h₂, h₃, h₄, h₅)
Like having labels for each piece of available information
Example: “This is about ‘dog’, this is about ‘runs’, this is about ‘fast’”

Value (V): “What is the actual information?”

The actual content/features from each encoder hidden state
The information you want to extract once you decide what’s relevant

How They Work Together:

Compare: Query matches against all Keys to find relevance
Score: Calculate attention weights (how much to focus on each Key)
Combine: Use weights to create weighted sum of corresponding Values

Example:

Query: “Generating ‘cat’”
Keys: [“the”, “big”, “cat”, “runs”, “fast”]
Attention scores: [0.1, 0.2, 0.8, 0.05, 0.05] ← focuses on “cat”
Result: Weighted combination emphasizing “cat” information

Question 5

Q

What is self-attention and how does it improve word representations?

Answer

A

Definition:
Self-attention allows each word in a sentence to attend to (focus on) other words in the SAME sentence to build better context-aware representations.
How It Works:

Query, Keys, Values: All come from the same input sequence
Each word asks: “Which other words in this sentence help me understand my meaning?”
Similarity scores: Computed between each word and all other words
Result: Context-aware word representations

Example:
In “The FBI is chasing a criminal”:

“chasing” attends strongly to “FBI” (the agent) and “criminal” (the target)
“it” in later sentences can attend to “animal” vs “street” to resolve ambiguity

Benefits:

Resolves ambiguity: Words get meaning from context
Better representations: Each word embedding includes relevant context
Parallel processing: Unlike RNNs, all words can be processed simultaneously
Long-range dependencies: Direct connections between any two words

Key Insight:
Instead of fixed word embeddings, each word gets a dynamic representation based on its current context.

Question 6

Q

Can you reflect the components of the scaled dot-product attention regarding their effect?

Answer

A

The scaled dot-product attention has three key components:
1. Dot Product (Q·K):

Effect: Measures similarity between query and keys
How: Higher dot product = more similar = more relevant
Purpose: Finds which keys are most relevant to the current query

Scaling Factor (√n):

Effect: Prevents attention scores from becoming too large
Why needed: Without scaling, large dot products cause extreme softmax outputs
Result: Avoids vanishing gradients and maintains stable training

Softmax Normalization:

Effect: Converts similarity scores into a probability distribution
Result: All attention weights sum to 1
Purpose: Creates weighted combination where model focuses most on relevant parts

Overall Effect:
The combination creates a stable, efficient attention mechanism that can determine how much to focus on each input position when processing the current position.
Formula:
Attention(Q,K,V) = softmax(QK^T / √d_k)V

Question 7

Q

Compare the three main architectures for sequence processing: Recurrence, Convolution, and Self-attention.

Answer

A

Recurrence (RNNs/LSTMs):

Connectivity: Sequential processing, each layer depends on previous
Context: Logarithmic decay - distant context gets weaker
Computation: Expensive sequential computation (can’t parallelize)
Memory: Hidden state carries information forward

Convolution (1D CNNs):

Connectivity: Local connections via fixed-width windows
Context: Fixed-width context window (limited range)
Computation: Computationally efficient (parallel processing)
Limitation: Can’t see beyond kernel size without stacking many layers

Self-attention (Transformers):

Connectivity: Direct connection between ALL input positions
Context: Direct access to entire sequence
Computation: Complexity depends on sequence length (quadratic)
Advantage: Parallel processing + long-range dependencies

Trade-offs:

RNN: Good for long sequences, but slow training
CNN: Fast but limited context
Self-attention: Best of both worlds, but expensive for very long sequences

Question 8

Q

What are the key components of the Transformer architecture and how do they work together?

Answer

A

Core Innovation:
“Attention is All You Need” - Transformers use ONLY attention mechanisms and feed-forward networks, no RNNs or CNNs.
Encoder (Left Side):

Input: Word embeddings + positional encoding
Layers: Stack of identical layers, each with:

Self-Attention: Each word attends to all other input words
Feed Forward: Neural network processing
Add & Normalize: Residual connections + layer normalization

Decoder (Right Side):

Similar structure but with THREE attention mechanisms:

Masked Self-Attention: Attends only to previous words (prevents looking ahead)
Encoder-Decoder Attention: Attends to encoder output (translation mechanism)
Feed Forward: Same as encoder

Key Components:

Positional Encoding: Injects position information (since no sequential processing)
Multi-layer: Multiple encoder and decoder layers stacked
Parallel Processing: All positions processed simultaneously

Why Revolutionary:

Faster training: Parallel vs sequential processing
Better long-range dependencies: Direct attention connections
State-of-the-art results: Outperformed RNN/CNN models

Question 9

Q

Why do Transformers need positional encoding and how does it work?

Answer

A

Positional Encoding - Add position information to word embeddings so the model knows where each word is in the sequence.
How It Works:

Mathematical encoding: Uses sine and cosine functions with different frequencies
Unique fingerprint: Each position i gets a unique positional vector p_i
Added to embeddings: final_embedding = word_embedding + positional_encoding

Key Properties:

Deterministic: Same position always gets same encoding
Relative positions: Model can learn relationships between positions
No learned parameters: Uses fixed mathematical functions

Why Necessary:
Without positional encoding, “The cat sat on the mat” would be processed the same as “mat the on sat cat The” - just a bag of words with attention!

Question 10

Q

What is multi-head attention and why is it better than single-head attention?

Answer

A

Instead of using one attention mechanism, use multiple “heads” in parallel - each learning different types of relationships in the data.
How It Works:

Multiple heads: Each head has its own Q, K, V transformations
Parallel processing: All heads compute attention simultaneously
Different focus: Each head learns different linguistic relationships
Combine results: Concatenate all head outputs and apply final linear transformation

What Each Head Learns:

Head 1: Subject-verb relationships (“dog” ↔ “barks”)
Head 2: Adjective-noun relationships (“big” ↔ “dog”)
Head 3: Long-distance dependencies
Head 4: Syntactic patterns
etc.

Benefits:

Richer representations: Captures multiple types of relationships simultaneously
Specialization: Each head focuses on different aspects of language
Robustness: If one head fails, others still provide useful information
Same computational cost: Heads run in parallel

Formula:
MultiHead(Q,K,V) = Concat(head₁, head₂, …, headₕ)W^O
where headᵢ = Attention(QWᵢQ, KWᵢK, VWᵢV)

Question 11

Q

What are residual connections in Transformers and why are they important?

Answer

A

Residual connections (also called skip connections) are shortcuts that add the input directly to the output of a layer: output = F(x) + x
The Problem They Solve:
Vanishing gradient problem - In deep networks, gradients become weaker as they flow backward through many layers, making training difficult.

Question 12

Q

How are Transformers trained for NLP using self-supervised learning, and what advantage do they have over RNNs?

Answer

A

Self-Supervised Training:
Next token prediction - Given a sequence of words, predict the next word in the sequence.
Training Process:

Input: “So long and thanks for…”
Target: “long and thanks for all”
Loss: Cross-entropy between predicted and correct word distributions
Self-supervised: No external labels needed - text provides its own supervision

Key Advantage Over RNNs:
RNN Training:

Inherently serial: Must process “So” → “long” → “and” → “thanks” sequentially
Slow: Each step depends on the previous one

Transformer Training:

Inherently parallel: Can predict all positions simultaneously
Fast: All predictions computed at once using attention
Scalable: Allows large context windows (1024-4096 tokens in GPT4)

Why This Works:

Teacher forcing: During training, use actual previous words (not predictions)
Masking: Prevent model from “cheating” by looking at future tokens
Large scale: Can process much longer sequences efficiently

Memory Aid:
Transformers = “Parallel prediction machine” vs RNNs = “Sequential word-by-word processing”!

Question 13

Q

How do Vision Transformers apply the Transformer architecture to computer vision?

Answer

A

How It Works:

Image → Patches: Divide image into 16x16 pixel patches
Patch Embedding: Each patch becomes a vector (like word embeddings)
Position Embedding: Add positional information to patches
Transformer Processing: Apply standard Transformer encoder blocks
Classification: Use output for image classification

Key Innovation:
“An image is worth 16x16 words” - instead of words in a sentence, we have patches in an image.
Advantages:

Attention across patches: Can focus on relevant image regions
Long-range dependencies: Direct connections between distant patches
Transfer learning: Can use pre-trained language model techniques
Scalability: Benefits from large datasets like language models

Emerging Attention Maps:
Different transformer layers learn to attend to different visual relationships and levels of detail, similar to how CNNs learn hierarchical features.
vs CNNs:

CNNs: Local receptive fields, hierarchical feature learning
ViTs: Global attention, direct patch relationships

Memory Aid:
ViT = “Treating image patches like words in a sentence” - apply text transformers to vision!

Question 14

Q

What are the key differences between BERT and GPT transformer architectures?

Answer

A

BERT (Bidirectional Encoder Representations from Transformers):

Architecture: Uses encoder blocks from Transformer
Bidirectional: Can see context from both left AND right sides
Training: Masked Language Modeling - predict missing words in sentences
Use case: Understanding tasks (classification, question answering, sentiment analysis)
Example training: “The [MASK] is chasing the mouse” → predict “cat”

GPT (Generative Pre-Training):

Architecture: Uses decoder blocks from Transformer
Autoregressive: Can only see previous context (left-to-right)
Training: Next token prediction - predict what comes next
Use case: Generation tasks (text completion, creative writing, dialogue)
Example training: “The cat is chasing” → predict “the”

Key Technical Differences:

BERT: Encoder-only, bidirectional attention, masked training
GPT: Decoder-only, causal (masked) attention, autoregressive training

Analogy:

BERT: Like a student who can see the whole sentence with some words blanked out
GPT: Like a student writing a story one word at a time, only seeing what came before

Training Data:

BERT: Books, Wikipedia (~3.3B tokens)
GPT: Larger web corpora (GPT-3: ~410B tokens)

Memory Aid:
BERT = “Fill in the blank expert”, GPT = “Story continuation expert”!

Question 15

Q

What are the key factors that determine Large Language Model performance and how do they scale?

Answer

A

Three Key Scaling Factors:
1. Model Size (Parameters):

Small models: Millions of parameters (GPT-1: 117M)
Large models: Hundreds of billions (GPT-3: 175B, MT-530B: 530B)
Trend: Exponential growth in model size over time

Dataset Size (Training Data):

Evolution: From small corpora to massive web crawls
Examples:

Shakespeare: 884K tokens
Books: ~11K books
CommonCrawl: 410B+ tokens (entire web)

Key insight: More data generally leads to better performance

Compute Budget:

Training cost: Measured in GPU-days or compute hours
Trade-off: Larger models need more compute but perform better
Scaling laws: Performance improves predictably with more compute

Scaling Laws (Kaplan et al.):
Performance scales as power laws with:

N (number of parameters)^α
C (compute budget)^β
D (dataset size)^γ

Key Insight:
“Scale is all you need” - Simply making models bigger with more data and compute consistently improves performance, leading to emergent capabilities

Question 16

Q

Summarize the key innovations and limitations of the Transformer architecture for sequence learnin

Answer

Study These Flashcards

A

Summarize the key innovations and limitations of the Transformer architecture for sequence learning.

Back (Answer)
Key Innovation - Attention Mechanism:

Core idea: Learn to attend to specific items in the history at any time step
Solves alignment problem: Direct connections between any input/output positions
Multiple variants: Different attention types for different complexities
Related to: Bayesian & intractable latent alignment (probabilistic foundations)

Major Benefits:

Mitigates vanishing gradient problem: Direct connections preserve gradients
Solves information decay: Access to all positions, not just compressed final state
Most effective improvement: Revolutionary advancement for sequence learning in the last decade
Multi-level relationships: Learn arbitrary relations in data on multiple levels

Transformer Success Factors:

Parallel processing: Unlike RNNs, can process entire sequences simultaneously
Scalability: Benefits from massive models and datasets
Transfer learning: Pre-train then fine-tune paradigm
Data engineering: Success depends on vast data selection & engineering

Key Limitations:

Finite sequence length: Structural shortcoming for very long sequences
Computational requirements: Need massive computation for huge models
Quadratic complexity: Attention scales O(n²) with sequence length

Overall Impact:
Transformers became the dominant architecture for sequence learning, enabling modern LLMs and revolutionizing NLP through the combination of attention, scale, and transfer learning.

Question 17

Q

Answer

Study These Flashcards

A

Week 7: Advanced Sequence Learning with Transformers Flashcards

(17 cards)