Week 5: Regularisation & Hyperoptimisation Flashcards

Question 1

Q

Describe L1 regularization and explain its key effects on neural network training.

Answer

A

L1: ||w||₁ = Σ|wᵢ| → Diamond → Sparsity
L2: ||w||₂ = √(Σwᵢ²) → Circle → Shrinkage only

Adds absolute penalty to the loss function
Formula: w_map = min NLL(w) + λ||w||₁
L1 norm: ||w||₁ = Σ|w_d| (sum of absolute values)

Mathematical Properties:

Tightest convex relaxation of L0 norm
L0 norm: ||w||₀ = number of non-zero weights
Geometric shape: Diamond/rhombus constraint region

Key Effects:
1. Feature Selection:

Built-in feature selection: Can set w_k,l = 0 (discards input x_k)
Automatic relevance detection: Irrelevant features get zero weights

Sparsity:

Leads to many zeros → sparse weight vectors
Weights clamped to 0 or active values (sharp transitions)

Noise Robustness:

Ignores noisy inputs by setting their weights to zero
Focuses on important features

Analogy: Similar to Lasso regression applied to neural network parameters
Memory Tip: “L1 = Lasso = Lots of zeros = feature selection”

Question 2

Q

Describe L2 regularization and explain how it differs from L1 regularization in its effects.”

Answer

A

Tikhonov regularization: Adds squared penalty to loss function
Formula: ŵ_map = argmin NLL(w) + λ||w||₂²
L2 norm: ||w||₂ = √(Σw_d²) = √(w^T w)

Key Effects:
1. Weight Decay:

Penalizes “peaky” weight vectors (weights far from zero)
Makes parameters smaller → drives weights closer to origin
Smooth shrinkage rather than setting to zero

Geometric Constraint:

Circular constraint region (vs L1’s diamond shape)
Weights driven closer to origin uniformly

Optimization Properties:

Optimum if all w_k,l = 0 but that leads to bad loss
Trade-off: Regularization (R) vs Loss (L) are antagonistic
Smooth solutions preferred over sparse ones

Practical Usage:

Global cross-validated λ (typically around 0.01)
Most common regularization in deep learning

Analogy: Similar to ridge regression applied to neural network parameters
Key Difference from L1: Shrinks weights smoothly rather than selecting features sparsely
Memory Tip: “L2 = Ridge = Smooth shrinkage toward zero”

Question 3

Q

How does regularization improve generalization and what is the relationship between complexity penalty and overfitting?

Answer

A

MAP Estimation Framework:

Goal: Obtain point estimate of unobserved quantity based on data
Without regularization: Pure likelihood maximization can overfit
With regularization: Balance data fit with model complexity

Regularization Formula:
L(θ; λ) = [1/N Σ ℓ(yn, θ; xn)] + λC(θ)

First term: Data likelihood (how well model fits training data)
Second term: Complexity penalty C(θ) (prior belief about θ)
λ: Controls trade-off between fit and complexity

Bayesian Interpretation:

Regularization = Prior on parameters
L(θ; λ) = -[log p(D|θ) + log p(θ)]
Incorporates prior beliefs about reasonable parameter values

Generalization Benefit:

Prevents overfitting by penalizing complex models
Reduces gradient variance → more stable minimum
Smoother loss landscape → better generalization
Trade-off: Less perfect fit on training data, better performance on test data

Key Insight: The graph shows how regularization creates a smoother, more stable loss surface with lower variance across different data batches
Memory Tip: “Regularization = Smooth complexity penalty = Better generalization”

Question 4

Q

What is normalization in neural networks and how does it solve the internal covariate shift problem?

Answer

A

Back:
Problem: Internal Covariate Shift

Input distributions change after dense/conv layers
Gradients become dependent on weight initialization and scale
Training becomes unstable and slow

Solution: Normalize Activations

Zero mean and unit variance: ẑn = (zn - μB)/√(σ²B + ε)
Scale and shift: z̃n = γ ⊙ ẑn + β (learnable parameters)

Types of Normalization:

Batch Norm: Normalize across batch dimension
Layer Norm: Normalize across feature dimension
Instance Norm: Normalize each instance separately
Group Norm: Normalize within feature groups

Key Advantages:

Reduced gradient dependence on weight initialization & scale
More stable training with higher learning rates
Changes input distribution after each layer consistently
Regularization effect (slight noise from batch statistics)

When Applied: Typically after linear transformations, before activation functions
Memory Tip: “Normalization = Stabilize distributions = Stable training

Question 5

Q

What is normalization in neural networks and how does it solve the internal covariate shift problem?

Answer

A

Problem: Internal Covariate Shift

Input distributions change after dense/conv layers
Gradients become dependent on weight initialization and scale
Training becomes unstable and slow

Solution: Normalize Activations

Zero mean and unit variance: ẑn = (zn - μB)/√(σ²B + ε)
Scale and shift: z̃n = γ ⊙ ẑn + β (learnable parameters)

Types of Normalization:

Batch Norm: Normalize across batch dimension
Layer Norm: Normalize across feature dimension
Instance Norm: Normalize each instance separately
Group Norm: Normalize within feature groups

Key Advantages:

Reduced gradient dependence on weight initialization & scale
More stable training with higher learning rates
Changes input distribution after each layer consistently
Regularization effect (slight noise from batch statistics)

When Applied: Typically after linear transformations, before activation functions
Memory Tip: “Normalization = Stabilize distributions = Stable training”

Flashcard 2: Dropout Regularization
Front:
“Explain dropout regularization and how it prevents overfitting through co-adaptation.”
Back:
Core Mechanism:

Stochastically remove hidden units during training
Set connections to 0 with probability p
Prevents co-adaptation between neurons

How it Works:

Training: Randomly drop units (p = [0.3, 0.5] for CNNs)
Testing: Use all units with Monte Carlo (MC) dropout or scaled weights

Why it Prevents Overfitting:

Prevents units from co-adaptation: Forces individual neurons to be useful
Behaves like Bernoulli noise: Adds stochasticity to training
Network ensemble effect: Each training step uses different sub-network

Bayesian Interpretation:

Approximates Bayesian posterior: p(y|x,D) ≈ 1/S Σ p(y|x, W^s + b)
Gaussian process approximation of Bayesian predictive distribution
Uncertainty quantification through multiple forward passes

Variants:

Standard dropout: Remove hidden units
Zoneout: Remove timesteps (for RNNs)

Memory Tip: “Dropout = Random removal = Prevents co-dependence = Better generalization”

Question 6

Q

Question 7

Q

What are the main challenges in neural network optimization beyond basic SGD problems?

Answer

A

Optimization Reality:

Convergence never guaranteed! No theoretical guarantees
Gradient norms increase during training (as shown in graph)
Classification error decreases despite gradient instability

Practical Solutions:

Rules of thumb and empirical experience guide training
Regularization techniques help stabilize training
Adaptive learning rates and momentum methods
Gradient clipping for exploding gradients

Key Insight: Neural network optimization is fundamentally challenging - success requires experience and heuristics, not just theory
Memory Tip: “Deep learning = Empirical art + Mathematical science”

Question 8

Q

What are the main approaches to hyperparameter optimization and why is systematic search important?

Answer

A

Back:
The Hyperparameter Explosion:

Model complexity → increased hyperparameters
Examples: Hidden units, layers, activation functions, convolution stride/filters, epochs, learning rate, batch size, optimizer, regularization, momentum…

Optimization Goal:

Monitor validation loss (not training loss)
Systematic procedures beat random guessing

Search Strategies:
1. Intuition + Grid Search (most common)

Structured exploration of parameter space
Regular grid over important parameters
Good for understood domains

Random Search (when no intuition)

Better than grid for high-dimensional spaces
Covers space more efficiently
Good starting point

Bayesian Optimization (theoretical best)

Models parameter-performance relationship
Intelligent exploration of promising regions
Most sample-efficient

Other Methods: Evolutionary algorithms, gradient-based optimization
Why Expertise Takes Time: Learning which hyperparameters matter most for different problems requires extensive experience
Memory Tip: “Grid for intuition, Random for exploration, Bayesian for efficiency”

Question 9

Q

Week 5: Regularisation & Hyperoptimisation Flashcards

(9 cards)