Week 4: Sequence Learning with RNNs Flashcards
(14 cards)
What is difference between one to one mapping and sequence?
In sequence learning we dont have the typical set up with 1-to-1.
There are many:
one to many - image capturing (what is in this image?)
many to one (sentiment analysis)
many to many (part of speech tagging)
many to many (machine translation)
Explain step-by-step how a Recurrent Neural Network processes a sequence
Step-by-Step RNN Processing:
- Initial State
Start with initial hidden state h₀ (often zeros)
Have input sequence: x₁, x₂, x₃, …, xₜ
- At Each Time Step t:
Input: Current input xₜ + previous hidden state hₜ₋₁
Computation:
Combine inputs: hₜ = f(Wₓₓxₜ + Wₕₕhₜ₋₁ + b)
f is activation function (tanh, ReLU, etc.)
Outputs:
New hidden state hₜ (carries “memory” forward)
Optional output yₜ (if needed at this step)
- Key Insight - The Recurrence:
Same weights (Wₓₓ, Wₕₕ) used at every time step
Hidden state hₜ becomes input for next time step
Information flows: h₀ → h₁ → h₂ → h₃ → …
What is One-to-Many architecture and what does the hidden state track in this setup?
Purpose of one to many:
Single input (e.g., image) → Multiple outputs (e.g., caption words)
Hidden State Role:
NOT tracking previous inputs
Tracking what outputs have been generated so far
Maintains memory of: original input + generation progress
Explain Many-to-One sequence architecture and how the hidden state accumulates information for classification.
Process entire sequence → produce single prediction
Prediction what word comes next fx
Aligned vs Unaligned Sequence-to-Sequence
Aligned Seq-to-Seq:
Definition: Input and output sequences have same length and direct correspondence
Each input position maps to exactly one output position (part-of-speech tagging)
Explain the encoder-decoder architecture and how it handles different input/output sequence lengths.
Encoder: Reads entire input sequence sequentially
Processes x₁ → x₂ → … → x_T
Updates hidden states: h₁ → h₂ → … → h_T
Final state h_T becomes context variable C
Context Variable C:
Fixed-size vector summarizing entire input
“Bridge” between encoder and decoder
Contains compressed information about input sequence
Decoder: Generates output sequence
Starts with context C as initial state
Generates output independently: y₁ → y₂ → … → y_S
Output length S determined by decoder (not input length T)
What is Backpropagation Through Time (BPTT) and what is the vanishing/exploding gradient problem in RNNs?
- Training method for RNNs by unfolding the network over time
- Creates a deep feedforward network from the recurrent structure
- Allows standard backpropagation to be applied
Time unfolding makes deep networks → deep networks have gradient problems”
What is teacher forcing and why is it used in RNN training?
Definition:
Training technique where you use correct target outputs instead of model predictions as input for the next time step
The Problem it Solves:
Normal RNNs use their own (often wrong) predictions during training
This causes error accumulation and poor learning
How it Works:
Training: Feed correct previous output y_{t-1} to next step
Testing: Use model’s own predictions o_{t-1}
Benefits:
Faster and more stable training
Better gradient flow
Implements maximum likelihood learning
Trade-off:
Training/test mismatch: Model trains on perfect inputs but must handle imperfect ones during inference
What is the LSTM cell state and how does it solve the vanishing gradient problem?
Explain the three LSTM gates and their roles in controlling information flow
Forget Gate:
Role: Gatekeeper - decides what to remember/forget from previous memory
Input: Previous hidden state + current input
Output: Values 0-1 for each cell state element
Function: Controls what old information gets discarded
Input Gate:
Role: Decides what new information is relevant and should be added
Input: Previous hidden state + current input
Output: Values 0-1 determining how much new info to incorporate
Function: Updates the cell state with new information
Output Gate:
Role: Determines how much current memory to share as output
Input: Previous hidden state + current input + updated cell state
Output: Values 0-1 controlling cell state output
Function: Controls what information gets passed to next time step
How does information flow in LSTM differ from standard RNN, and why does this solve vanishing gradients?
Standard RNN Flow:
h_{t-1} → [multiply by W] → [apply tanh] → h_t
Every step passes through tanh activation
Gradient path: Goes through tanh derivative (≤1) at every step
Result: Gradients shrink exponentially over time
LSTM Flow - Two Parallel Paths:
Path 1: Cell State (Linear Highway)
C_{t-1} → [forget gate] → [add new info] → C_t
No tanh applied to main information flow
Mostly linear operations preserve gradient magnitude
Path 2: Hidden State (Controlled Output)
C_t → [tanh] → [output gate] → h_t
Just a “view” of cell state for external use
What is the effect of the gating mechanism in LSTM cells?
Primary Effect: Solves the vanishing gradient problem by enabling selective, controlled information flow
Key Effects:
1. Enables Long-Term Dependencies:
Can learn relationships across many time steps
Maintains relevant information over long sequences
- Creates Selective Memory:
Forget gate: Discards irrelevant old information
Input gate: Selectively incorporates new information
Output gate: Controls what information is shared
- Preserves Gradient Flow:
Linear pathways through cell state avoid repeated nonlinearities
Gates control rather than transform core information
Prevents gradient shrinkage that plagued standard RNNs
- Improves Learning Performance:
Better sequence modeling capabilities
More stable training
Can handle longer sequences effectively
Implementation: Three gates (forget, input, output) work together to create controlled information highways
Bottom Line: The gating mechanism transforms RNNs from networks that forget quickly into networks with selective, persistent memory capable of learning complex temporal patterns.
Memory Tip: “Gates create smart memory: remember what matters, forget what doesn’t, share what’s needed”
What are the key differences between LSTM and GRU, and what are the trade-offs?
LSTM:
3 gates: Input, Output, Forget gates
Separate cell state (C_t) and hidden state (H_t)
More complex architecture with more parameters
GRU:
2 gates: Update gate, Reset gate (simpler)
Single hidden state (no separate cell state)
Fewer parameters than LSTM
Performance Trade-offs:
LSTM Advantages:
More accurate on longer sequences
Better long-term memory due to separate cell state
More expressive due to additional complexity
GRU Advantages:
Fewer training parameters → faster training
Simpler architecture → easier to implement
Often comparable performance on shorter sequences
Common Challenge:
Both: Many hyperparameters, difficult to master!
Both: Solve vanishing gradient problem through gating
When to Use:
LSTM: When you need maximum accuracy on long sequences
GRU: When you want simpler, faster training with good performance
Memory Tip: “LSTM = More gates, more accuracy; GRU = Fewer gates, faster training”