Week 5: Regularisation & Hyperoptimisation Flashcards
(9 cards)
Describe L1 regularization and explain its key effects on neural network training.
- L1: ||w||₁ = Σ|wᵢ| → Diamond → Sparsity
- L2: ||w||₂ = √(Σwᵢ²) → Circle → Shrinkage only
Adds absolute penalty to the loss function
Formula: w_map = min NLL(w) + λ||w||₁
L1 norm: ||w||₁ = Σ|w_d| (sum of absolute values)
Mathematical Properties:
Tightest convex relaxation of L0 norm
L0 norm: ||w||₀ = number of non-zero weights
Geometric shape: Diamond/rhombus constraint region
Key Effects:
1. Feature Selection:
Built-in feature selection: Can set w_k,l = 0 (discards input x_k)
Automatic relevance detection: Irrelevant features get zero weights
- Sparsity:
Leads to many zeros → sparse weight vectors
Weights clamped to 0 or active values (sharp transitions)
- Noise Robustness:
Ignores noisy inputs by setting their weights to zero
Focuses on important features
Analogy: Similar to Lasso regression applied to neural network parameters
Memory Tip: “L1 = Lasso = Lots of zeros = feature selection”
Describe L2 regularization and explain how it differs from L1 regularization in its effects.”
Tikhonov regularization: Adds squared penalty to loss function
Formula: ŵ_map = argmin NLL(w) + λ||w||₂²
L2 norm: ||w||₂ = √(Σw_d²) = √(w^T w)
Key Effects:
1. Weight Decay:
Penalizes “peaky” weight vectors (weights far from zero)
Makes parameters smaller → drives weights closer to origin
Smooth shrinkage rather than setting to zero
- Geometric Constraint:
Circular constraint region (vs L1’s diamond shape)
Weights driven closer to origin uniformly
- Optimization Properties:
Optimum if all w_k,l = 0 but that leads to bad loss
Trade-off: Regularization (R) vs Loss (L) are antagonistic
Smooth solutions preferred over sparse ones
Practical Usage:
Global cross-validated λ (typically around 0.01)
Most common regularization in deep learning
Analogy: Similar to ridge regression applied to neural network parameters
Key Difference from L1: Shrinks weights smoothly rather than selecting features sparsely
Memory Tip: “L2 = Ridge = Smooth shrinkage toward zero”
How does regularization improve generalization and what is the relationship between complexity penalty and overfitting?
MAP Estimation Framework:
Goal: Obtain point estimate of unobserved quantity based on data
Without regularization: Pure likelihood maximization can overfit
With regularization: Balance data fit with model complexity
Regularization Formula:
L(θ; λ) = [1/N Σ ℓ(yn, θ; xn)] + λC(θ)
First term: Data likelihood (how well model fits training data)
Second term: Complexity penalty C(θ) (prior belief about θ)
λ: Controls trade-off between fit and complexity
Bayesian Interpretation:
Regularization = Prior on parameters
L(θ; λ) = -[log p(D|θ) + log p(θ)]
Incorporates prior beliefs about reasonable parameter values
Generalization Benefit:
Prevents overfitting by penalizing complex models
Reduces gradient variance → more stable minimum
Smoother loss landscape → better generalization
Trade-off: Less perfect fit on training data, better performance on test data
Key Insight: The graph shows how regularization creates a smoother, more stable loss surface with lower variance across different data batches
Memory Tip: “Regularization = Smooth complexity penalty = Better generalization”
What is normalization in neural networks and how does it solve the internal covariate shift problem?
Back:
Problem: Internal Covariate Shift
Input distributions change after dense/conv layers
Gradients become dependent on weight initialization and scale
Training becomes unstable and slow
Solution: Normalize Activations
Zero mean and unit variance: ẑn = (zn - μB)/√(σ²B + ε)
Scale and shift: z̃n = γ ⊙ ẑn + β (learnable parameters)
Types of Normalization:
Batch Norm: Normalize across batch dimension
Layer Norm: Normalize across feature dimension
Instance Norm: Normalize each instance separately
Group Norm: Normalize within feature groups
Key Advantages:
Reduced gradient dependence on weight initialization & scale
More stable training with higher learning rates
Changes input distribution after each layer consistently
Regularization effect (slight noise from batch statistics)
When Applied: Typically after linear transformations, before activation functions
Memory Tip: “Normalization = Stabilize distributions = Stable training
What is normalization in neural networks and how does it solve the internal covariate shift problem?
Problem: Internal Covariate Shift
Input distributions change after dense/conv layers
Gradients become dependent on weight initialization and scale
Training becomes unstable and slow
Solution: Normalize Activations
Zero mean and unit variance: ẑn = (zn - μB)/√(σ²B + ε)
Scale and shift: z̃n = γ ⊙ ẑn + β (learnable parameters)
Types of Normalization:
Batch Norm: Normalize across batch dimension
Layer Norm: Normalize across feature dimension
Instance Norm: Normalize each instance separately
Group Norm: Normalize within feature groups
Key Advantages:
Reduced gradient dependence on weight initialization & scale
More stable training with higher learning rates
Changes input distribution after each layer consistently
Regularization effect (slight noise from batch statistics)
When Applied: Typically after linear transformations, before activation functions
Memory Tip: “Normalization = Stabilize distributions = Stable training”
Flashcard 2: Dropout Regularization
Front:
“Explain dropout regularization and how it prevents overfitting through co-adaptation.”
Back:
Core Mechanism:
Stochastically remove hidden units during training
Set connections to 0 with probability p
Prevents co-adaptation between neurons
How it Works:
Training: Randomly drop units (p = [0.3, 0.5] for CNNs)
Testing: Use all units with Monte Carlo (MC) dropout or scaled weights
Why it Prevents Overfitting:
Prevents units from co-adaptation: Forces individual neurons to be useful
Behaves like Bernoulli noise: Adds stochasticity to training
Network ensemble effect: Each training step uses different sub-network
Bayesian Interpretation:
Approximates Bayesian posterior: p(y|x,D) ≈ 1/S Σ p(y|x, W^s + b)
Gaussian process approximation of Bayesian predictive distribution
Uncertainty quantification through multiple forward passes
Variants:
Standard dropout: Remove hidden units
Zoneout: Remove timesteps (for RNNs)
Memory Tip: “Dropout = Random removal = Prevents co-dependence = Better generalization”
What are the main challenges in neural network optimization beyond basic SGD problems?
Optimization Reality:
Convergence never guaranteed! No theoretical guarantees
Gradient norms increase during training (as shown in graph)
Classification error decreases despite gradient instability
Practical Solutions:
Rules of thumb and empirical experience guide training
Regularization techniques help stabilize training
Adaptive learning rates and momentum methods
Gradient clipping for exploding gradients
Key Insight: Neural network optimization is fundamentally challenging - success requires experience and heuristics, not just theory
Memory Tip: “Deep learning = Empirical art + Mathematical science”
What are the main approaches to hyperparameter optimization and why is systematic search important?
Back:
The Hyperparameter Explosion:
Model complexity → increased hyperparameters
Examples: Hidden units, layers, activation functions, convolution stride/filters, epochs, learning rate, batch size, optimizer, regularization, momentum…
Optimization Goal:
Monitor validation loss (not training loss)
Systematic procedures beat random guessing
Search Strategies:
1. Intuition + Grid Search (most common)
Structured exploration of parameter space
Regular grid over important parameters
Good for understood domains
- Random Search (when no intuition)
Better than grid for high-dimensional spaces
Covers space more efficiently
Good starting point
- Bayesian Optimization (theoretical best)
Models parameter-performance relationship
Intelligent exploration of promising regions
Most sample-efficient
- Other Methods: Evolutionary algorithms, gradient-based optimization
Why Expertise Takes Time: Learning which hyperparameters matter most for different problems requires extensive experience
Memory Tip: “Grid for intuition, Random for exploration, Bayesian for efficiency”