Week 3: Deep Learning Basics & Optimisation Flashcards

(11 cards)

1
Q

What is backpropagation and why is it more efficient than forward mode differentiation?

A

Backpropagation = Reverse mode automatic differentiation

Problem with Forward Mode:

Need to calculate derivatives for ALL functions in chain
For deep networks: computationally expensive
Chain rule: ∂o/∂x = J_f₄(x₄) J_f₃(x₃) J_f₂(x₂) J_f₁(x)

Backpropagation Solution:

Start from output and work backwards
u^T J_f(x) = (u · ∂f/∂x₁, …, u · ∂f/∂xₙ)
Key advantage: More efficient when output dimension < input dimension

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are vanishing and exploding gradients, and why do they occur in deep networks?

A

Back:

Problem: Gradients become exponentially small (vanishing) or large (exploding) as they propagate through deep networks.
Why it happens:

Chain rule multiplication: Gradients multiply at each layer during backpropagation
Small weights (< 1): Gradients shrink exponentially → vanishing gradients
Large weights (> 1): Gradients grow exponentially → exploding gradients

Consequences:

Vanishing: Early layers learn very slowly or not at all
Exploding: Training becomes unstable, weights blow up
Deep networks (>10 layers): Early layers lose influence on output

Impact: Makes training deep networks very difficult - either early layers don’t learn or training becomes unstable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Is deep learning success just about having bigger networks?

A

No! Size alone isn’t the answer.
Key insights:

Network parameter count has grown exponentially (driven by hardware)
But analogies to brain size are often ill-posed

More parameters ≠ automatically better performance
Architecture, training methods, and data quality matter more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does convolution work in CNNs?

A

Convolution = Sliding filter/kernel over input to create feature map

Process:

Shift filter/kernel over input pattern (image)
Calculate element-wise multiplication + sum
Result: Feature map (smaller projection)
Goal: Find highest values (best template matches)

Formula: W * X = ∑∑ w_{u,v} x_{i+u,j+v}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are padding and striding in CNNs

A

Padding: Add border (usually zeros) around input

Purpose: Handle boundaries, maintain feature map size
Formula: (f_h + 2p_h - k_h + 1) × (f_w + 2p_w - k_w + 1)

Striding: Skip pixels when sliding filter

Stride > 1: More efficient, smaller feature maps
Purpose: Reduce computational cost and output size
Trade-off: Less detailed feature maps

Effect: Both control output feature map dimensions and computational efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is pooling in CNNs and why is it used?

A

Pooling = Downsample feature maps to higher-level, location-invariant information
Purpose:

Translation invariance: Object detection regardless of exact location
Dimensionality reduction: Fewer parameters, less computation
Higher-level features: Focus on “what” not “where”

Types:

Max pooling: Take maximum value in region
Average pooling: Take mean value in region
Global pooling: Pool entire feature map

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are residual connections and why are they important?

A

Residual connections = Shortcut connections that perform identity mapping

Structure: F₁(x) + x (output = transformation + original input)
Key benefits:

Mitigates vanishing gradient problem: Gradients flow through network more effectively

Identity mapping: Allows network to learn “what to change” rather than “what to output”

Same parameter count: No additional weights needed
Skip layers: Maintains reference to untransformed information

Result: Enables training of much deeper networks (100+ layers) successfully.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do residual connections solve the vanishing gradient problem?

A

Residual connections provide gradient “highways” through the network
Problem: In deep networks, gradients become exponentially small through multiplication
Solution:

Shortcut paths: Gradients can flow directly through skip connections
Additive structure: F(x) + x means gradients have direct path backward
Identity preservation: Network can easily learn identity function if needed

Practical impact: Can train networks 10x deeper than before (ResNet-152 vs AlexNet-8 layers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What makes CNN filters powerful for feature extraction?

A

Back:
Filter hierarchy provides automatic feature extraction with built-in invariances
Power comes from:

Filter hierarchy: Edge detectors → shapes → objects → concepts
Translation invariance: Features detected regardless of location
Representation transformation: Raw pixels → meaningful semantic features

Result: CNNs automatically learn the right features for the task, eliminating need for manual feature engineering.
Impact: Revolutionized computer vision by learning better features than human-designed ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does a CNN work?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Please describe the effect
of convolutional layers.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly