Week 3: Deep Learning Basics & Optimisation Flashcards
(11 cards)
What is backpropagation and why is it more efficient than forward mode differentiation?
Backpropagation = Reverse mode automatic differentiation
Problem with Forward Mode:
Need to calculate derivatives for ALL functions in chain
For deep networks: computationally expensive
Chain rule: ∂o/∂x = J_f₄(x₄) J_f₃(x₃) J_f₂(x₂) J_f₁(x)
Backpropagation Solution:
Start from output and work backwards
u^T J_f(x) = (u · ∂f/∂x₁, …, u · ∂f/∂xₙ)
Key advantage: More efficient when output dimension < input dimension
What are vanishing and exploding gradients, and why do they occur in deep networks?
Back:
Problem: Gradients become exponentially small (vanishing) or large (exploding) as they propagate through deep networks.
Why it happens:
Chain rule multiplication: Gradients multiply at each layer during backpropagation
Small weights (< 1): Gradients shrink exponentially → vanishing gradients
Large weights (> 1): Gradients grow exponentially → exploding gradients
Consequences:
Vanishing: Early layers learn very slowly or not at all
Exploding: Training becomes unstable, weights blow up
Deep networks (>10 layers): Early layers lose influence on output
Impact: Makes training deep networks very difficult - either early layers don’t learn or training becomes unstable.
Is deep learning success just about having bigger networks?
No! Size alone isn’t the answer.
Key insights:
Network parameter count has grown exponentially (driven by hardware)
But analogies to brain size are often ill-posed
More parameters ≠ automatically better performance
Architecture, training methods, and data quality matter more
How does convolution work in CNNs?
Convolution = Sliding filter/kernel over input to create feature map
Process:
Shift filter/kernel over input pattern (image)
Calculate element-wise multiplication + sum
Result: Feature map (smaller projection)
Goal: Find highest values (best template matches)
Formula: W * X = ∑∑ w_{u,v} x_{i+u,j+v}
What are padding and striding in CNNs
Padding: Add border (usually zeros) around input
Purpose: Handle boundaries, maintain feature map size
Formula: (f_h + 2p_h - k_h + 1) × (f_w + 2p_w - k_w + 1)
Striding: Skip pixels when sliding filter
Stride > 1: More efficient, smaller feature maps
Purpose: Reduce computational cost and output size
Trade-off: Less detailed feature maps
Effect: Both control output feature map dimensions and computational efficiency.
What is pooling in CNNs and why is it used?
Pooling = Downsample feature maps to higher-level, location-invariant information
Purpose:
Translation invariance: Object detection regardless of exact location
Dimensionality reduction: Fewer parameters, less computation
Higher-level features: Focus on “what” not “where”
Types:
Max pooling: Take maximum value in region
Average pooling: Take mean value in region
Global pooling: Pool entire feature map
What are residual connections and why are they important?
Residual connections = Shortcut connections that perform identity mapping
Structure: F₁(x) + x (output = transformation + original input)
Key benefits:
Mitigates vanishing gradient problem: Gradients flow through network more effectively
Identity mapping: Allows network to learn “what to change” rather than “what to output”
Same parameter count: No additional weights needed
Skip layers: Maintains reference to untransformed information
Result: Enables training of much deeper networks (100+ layers) successfully.
How do residual connections solve the vanishing gradient problem?
Residual connections provide gradient “highways” through the network
Problem: In deep networks, gradients become exponentially small through multiplication
Solution:
Shortcut paths: Gradients can flow directly through skip connections
Additive structure: F(x) + x means gradients have direct path backward
Identity preservation: Network can easily learn identity function if needed
Practical impact: Can train networks 10x deeper than before (ResNet-152 vs AlexNet-8 layers)
What makes CNN filters powerful for feature extraction?
Back:
Filter hierarchy provides automatic feature extraction with built-in invariances
Power comes from:
Filter hierarchy: Edge detectors → shapes → objects → concepts
Translation invariance: Features detected regardless of location
Representation transformation: Raw pixels → meaningful semantic features
Result: CNNs automatically learn the right features for the task, eliminating need for manual feature engineering.
Impact: Revolutionized computer vision by learning better features than human-designed ones.
How does a CNN work?
Please describe the effect
of convolutional layers.