Week 4: Sequence Learning with RNNs Flashcards

Question 1

Q

What is difference between one to one mapping and sequence?

Answer

A

In sequence learning we dont have the typical set up with 1-to-1.
There are many:

one to many - image capturing (what is in this image?)

many to one (sentiment analysis)
many to many (part of speech tagging)
many to many (machine translation)

Question 2

Q

Explain step-by-step how a Recurrent Neural Network processes a sequence

Answer

A

Step-by-Step RNN Processing:

Initial State
Start with initial hidden state h₀ (often zeros)

Have input sequence: x₁, x₂, x₃, …, xₜ

At Each Time Step t:
Input: Current input xₜ + previous hidden state hₜ₋₁

Computation:

Combine inputs: hₜ = f(Wₓₓxₜ + Wₕₕhₜ₋₁ + b)

f is activation function (tanh, ReLU, etc.)

Outputs:

New hidden state hₜ (carries “memory” forward)
Optional output yₜ (if needed at this step)

Key Insight - The Recurrence:

Same weights (Wₓₓ, Wₕₕ) used at every time step
Hidden state hₜ becomes input for next time step
Information flows: h₀ → h₁ → h₂ → h₃ → …

Question 3

Q

What is One-to-Many architecture and what does the hidden state track in this setup?

Answer

A

Purpose of one to many:
Single input (e.g., image) → Multiple outputs (e.g., caption words)

Hidden State Role:
NOT tracking previous inputs

Tracking what outputs have been generated so far

Maintains memory of: original input + generation progress

Question 4

Q

Explain Many-to-One sequence architecture and how the hidden state accumulates information for classification.

Answer

A

Process entire sequence → produce single prediction

Prediction what word comes next fx

Question 5

Q

Aligned vs Unaligned Sequence-to-Sequence

Answer

A

Aligned Seq-to-Seq:
Definition: Input and output sequences have same length and direct correspondence

Each input position maps to exactly one output position (part-of-speech tagging)

Question 6

Q

Explain the encoder-decoder architecture and how it handles different input/output sequence lengths.

Answer

A

Encoder: Reads entire input sequence sequentially

Processes x₁ → x₂ → … → x_T
Updates hidden states: h₁ → h₂ → … → h_T
Final state h_T becomes context variable C

Context Variable C:

Fixed-size vector summarizing entire input
“Bridge” between encoder and decoder
Contains compressed information about input sequence

Decoder: Generates output sequence

Starts with context C as initial state
Generates output independently: y₁ → y₂ → … → y_S
Output length S determined by decoder (not input length T)

Question 7

Q

What is Backpropagation Through Time (BPTT) and what is the vanishing/exploding gradient problem in RNNs?

Answer

A

Training method for RNNs by unfolding the network over time
Creates a deep feedforward network from the recurrent structure
Allows standard backpropagation to be applied

Time unfolding makes deep networks → deep networks have gradient problems”

Question 8

Q

What is teacher forcing and why is it used in RNN training?

Answer

A

Definition:

Training technique where you use correct target outputs instead of model predictions as input for the next time step

The Problem it Solves:

Normal RNNs use their own (often wrong) predictions during training
This causes error accumulation and poor learning

How it Works:

Training: Feed correct previous output y_{t-1} to next step
Testing: Use model’s own predictions o_{t-1}

Benefits:

Faster and more stable training
Better gradient flow
Implements maximum likelihood learning

Trade-off:

Training/test mismatch: Model trains on perfect inputs but must handle imperfect ones during inference

Question 9

Q

What is the LSTM cell state and how does it solve the vanishing gradient problem?

Question 10

Q

Explain the three LSTM gates and their roles in controlling information flow

Answer

A

Forget Gate:

Role: Gatekeeper - decides what to remember/forget from previous memory
Input: Previous hidden state + current input
Output: Values 0-1 for each cell state element
Function: Controls what old information gets discarded

Input Gate:

Role: Decides what new information is relevant and should be added
Input: Previous hidden state + current input
Output: Values 0-1 determining how much new info to incorporate
Function: Updates the cell state with new information

Output Gate:

Role: Determines how much current memory to share as output
Input: Previous hidden state + current input + updated cell state
Output: Values 0-1 controlling cell state output
Function: Controls what information gets passed to next time step

Question 11

Q

How does information flow in LSTM differ from standard RNN, and why does this solve vanishing gradients?

Answer

A

Standard RNN Flow:

h_{t-1} → [multiply by W] → [apply tanh] → h_t
Every step passes through tanh activation
Gradient path: Goes through tanh derivative (≤1) at every step
Result: Gradients shrink exponentially over time

LSTM Flow - Two Parallel Paths:
Path 1: Cell State (Linear Highway)

C_{t-1} → [forget gate] → [add new info] → C_t
No tanh applied to main information flow
Mostly linear operations preserve gradient magnitude

Path 2: Hidden State (Controlled Output)

C_t → [tanh] → [output gate] → h_t
Just a “view” of cell state for external use

Question 12

Q

What is the effect of the gating mechanism in LSTM cells?

Answer

A

Primary Effect: Solves the vanishing gradient problem by enabling selective, controlled information flow
Key Effects:
1. Enables Long-Term Dependencies:

Can learn relationships across many time steps
Maintains relevant information over long sequences

Creates Selective Memory:

Forget gate: Discards irrelevant old information
Input gate: Selectively incorporates new information
Output gate: Controls what information is shared

Preserves Gradient Flow:

Linear pathways through cell state avoid repeated nonlinearities
Gates control rather than transform core information
Prevents gradient shrinkage that plagued standard RNNs

Improves Learning Performance:

Better sequence modeling capabilities
More stable training
Can handle longer sequences effectively

Implementation: Three gates (forget, input, output) work together to create controlled information highways
Bottom Line: The gating mechanism transforms RNNs from networks that forget quickly into networks with selective, persistent memory capable of learning complex temporal patterns.
Memory Tip: “Gates create smart memory: remember what matters, forget what doesn’t, share what’s needed”

Question 13

Q

What are the key differences between LSTM and GRU, and what are the trade-offs?

Answer

A

LSTM:

3 gates: Input, Output, Forget gates
Separate cell state (C_t) and hidden state (H_t)
More complex architecture with more parameters

GRU:

2 gates: Update gate, Reset gate (simpler)
Single hidden state (no separate cell state)
Fewer parameters than LSTM

Performance Trade-offs:
LSTM Advantages:

More accurate on longer sequences
Better long-term memory due to separate cell state
More expressive due to additional complexity

GRU Advantages:

Fewer training parameters → faster training
Simpler architecture → easier to implement
Often comparable performance on shorter sequences

Common Challenge:

Both: Many hyperparameters, difficult to master!
Both: Solve vanishing gradient problem through gating

When to Use:

LSTM: When you need maximum accuracy on long sequences
GRU: When you want simpler, faster training with good performance

Memory Tip: “LSTM = More gates, more accuracy; GRU = Fewer gates, faster training”

Question 14

Q

Week 4: Sequence Learning with RNNs Flashcards

(14 cards)