Topic 5: Regularisation & Hyperoptimisation Flashcards
(16 cards)
What are the three dataset partitions in deep learning, and what are their roles?
Train, validation and test
training data for online learning:
- use the validation set for monitoring, and for hyperparameter tuning
- shuffle the training set → successive training examples belong rarely to the same class
- present samples that produce a large error more frequently
- check for catastrophic forgetting
basic assumption: training and validation data from same distribution:
- covariate shit:
- training and test input distributions are different but target function remains unchanged → weak extrapolation
- possible solutions: assining weights to data based on “importance” → rebalance data distributions
high bias: more or better units/features
high variance: need more data
Explain the concepts of underfitting and overfitting. What causes them, and how can we detect and address each?
underfitting - we have a model that doesn’t fit our problem well
- the model performs poorly
- the training set error is significantly larger than the expected of the ideal model
remedy:
- increase the model’s complexity (the parameters)
- add more features
- train for a longer time
overfitting - the model has seen such a specific perspective of the data, and therefore is very ungeneralisable
- 0% error by memorisation
- memorisation ≠ generalisation
- all noise is captured
- too many parameters
remedy:
- more training data
- reducing the model’s complexity
- cross validation, early stopping, regularisation
What is early stopping, how is it implemented, and why is it effective?
we use the validation set for this. independent data for deciding when to stop. we stop when the validation loss is less than the minimum improvement in during the last patientce epochs (e.g. 5 epochs to determine when to stop).
it helps prevent overfitting. it could be good to stop before it gets bad (so stop at 25 epochs, e.g.). we use the validation data, to help decide when we should stop
What is L1?
L1 regularisation adds a penalty to the absolute values of weights, encouraging sparse solutions, many weights become exactly zero:
https://docs.google.com/document/d/1-ScjmBUOUDy1w0yxU2KCRwGlGty3tsqqMM42DYhY6B4/edit?tab=t.0
key features:
- it ignores noisy inputs
- has a built-in feature selection
- $w_{k,l} = 0$, discards an input $x_k$
- it leads to many zeros → it creates sparse weight vectors
- Similar to LASSO regression
w = weights, x = input, d = dimension of the weight vector
L1 regularisation results
L1 regularisation encourages many weights to become exactly 0, and this causes:
- that connections get pruned in the network
- Many neurons stop receiving or sending signals
- The network becomes simpler
in modern neural network frameworks:
- tensors become evaluated in parallel, so pruning doesn’t make it faster
- sparse matrices are not cheaper or faster
What are the differences between L1 and L2?
Comparison:
Feature: L1
Penalty Type: Absolute values
Sparsity: Yes (zero weights)
Optimisation: Can lead to non-differentiability at 0
Use Case: Feature selection
Feature: L2
Penalty Type: Squared values
Sparsity: No (all weights shrink)
Optimisation: Smooth everywhere
Use Case: Smooth generalisation
What is L2?
L2 regularisation (a.k.a. Tikhonov regularisation) adds a squared penalty to the loss:
https://docs.google.com/document/d/1-ScjmBUOUDy1w0yxU2KCRwGlGty3tsqqMM42DYhY6B4/edit?tab=t.0
- it penalises weight vectors that are far from 0 more, even if they’re already pretty small, we still penalise the values that might peak
- it encourages smaller weights, and leads to weight decay
- Similar to LASSO regression
there is an optimum if all $w_{k,l} = 0$ (when all the weights are 0), but that also leads to a bad loss (high, because the model can’t learn anything from data → predictions are meaningless.
What are some other ways of regularisation?
Dropout = Stochastic removal of hidden units (stochastic = Random but with a controlled probability distribution.)
you set some connections/neurons to 0, with a probability, so e.g. neuron 4 has a probability of 50% to “turn off”.
it prevents dependency, and makes the neurons more independent (avoid co-adaptation)
it behaves like bernoulli noise (Each neuron is multiplied by a random variable drawn from a Bernoulli distribution)
there are variants to this:
- dropout at test time: Monte Carlo (MC) dropout
- Perform multiple forward passes with dropout enabled
- You run the network multiple times with different dropout masks, then average the outputs to approximate the true distribution over predictions.
- Instead of doing exact Bayesian inference (which is often very hard or impossible in deep learning), we can use a Gaussian process (GP) as anapproximation to model the uncertainty in predictions.
zoneout: stochastically removes timesteps
- At each timestep, some hidden units are randomly “frozen” (i.e., not updated).
- Instead of computing a new value for that hidden unit, we just reuse the value from the previous timestep.
- This happens with a certain probability ppp, just like dropout.
What is batch normalization, and how does it address internal covariate shift?
by normalising activations (normalising the outputs of activations) to zero mean and unit variance:
- Zero mean
- Unit variance
this combats the “internal covariate shift”: The change in the distribution of activations (inputs to a layer) during training, as the parameters of previous layers change.
normalising is by e.g. using:
- mini-batch mean
- mini batch variance
- scale and shift
- normalise
advantages:
- reduced dependence of gradients on weight initialisation and scaling
- changes the input distribution e.g. after dense or convolutional layers
What is the role of parameter initialization in deep learning, and how do initialization schemes like Xavier or He help?
Improper weight initialisation leads to vanishing or exploding gradients, making learning unstable or impossible.
Common schemes:
LeCun Initialization: σ^2 = 1 / n_in
Best for sigmoid activations
He Initialization:
σ^2 = 2 / n_in
Designed for ReLU activations
Xavier/Glorot initialisation:
σ^2 = 2 / (n_in+n_out)
Balances signal variance between input/output layers; good for tanh
The goal is to preserve variance of activations across layers so that signal doesn’t explode or vanish.
How can we find quantitative Characteristics of the Trained Model?
Comparing metrics averaged over test set
- different models
- different data sets
- different hyperparameter settings
Sometimes class-specific:
Metrics are task-specific & not always meaningful
How can we find qualitative Characteristics of the Trained Model?
Visualise your input and output!
Picking cherries and lemons
- Representative good and bad results
- Results for edge case
- Peculiar results (where we have an explanation or no clue)
What are the major challenges in neural network optimization, especially with SGD?
SGD-specific issues:
- Ill-conditioned gradients: gradients vary in scale → unstable updates
- Plateaus: flat loss regions → slow learning
- Saddle points: gradients = 0 but not a minimum → training stalls
- Noisy gradients: mini-batches provide noisy estimates
Deep architecture issues:
- Exploding gradients: large updates destabilize training
- Vanishing gradients: small updates → no learning in early layers
- Cliffs: sharp areas in loss → instability
- Long-term dependencies: in RNNs, early signals get lost
Solutions:
- Proper initialisation of the weights
- BatchNorm - a normalisation technique used to make training of artificial neural networks faster and more stable by adjusting the inputs to each layer. re-centering them around zero and re-scaling them to a standard size.
- Gradient clipping
- Using adaptive optimisers (Adam, RMSprop)
What are meta-/hyperparameters in deep learning, and how do we optimize them?
with increased model complexity comes an increase in the number of hyperparameters
- hidden units, number of layers, activation function, convolution (stride/filters), epochs, learning rate, batch size, optimiser, regularisation, momentum
we optimise these by monitoring the validation loss
systematic procedures:
- Intuition + Grid Search (most common)
- Random Search (if we have no idea)
- Bayesian Optimization (best in theory)
- Evolutionary Algorithms, Gradient-based, ..
How can we find characteristics of training?
Observations:
- “Wobbly”, oscillations
- Spikes somewhat periodic
- Converges but each wrong decision takes a short time to correct
Possible explanation:
- Online learning with unshuffled data
- Possibly imbalanced
Things to try:
- Shuffle training samples
- Increase batch size
- Lower learning rate (decay)
Describe the L1 regularisation. Give us the formula
L1 regularisation is a technique used to prevent overfitting in machine learning by adding a penalty on the absolute values of model parameters (weights).
Given a loss function (e.g., Negative Log-Likelihood or MSE), L1 regularisation adds a penalty term: (both formulas) https://docs.google.com/document/d/1-ScjmBUOUDy1w0yxU2KCRwGlGty3tsqqMM42DYhY6B4/edit?tab=t.0
- ||𝑤||_1 is the L1 norm = sum of absolute values of the weights
- 𝜆 is a hyperparameter that controls the strength of the penalty
L1 regularization shrinks weights.
But more importantly, it often drives some weights exactly to zero.
This leads to sparse models, meaning some features are “ignored” (i.e., their weight is 0).
Explain how dropout works as a regularisation technique and its Bayesian interpretation.
Dropout randomly sets a subset of activations in a layer to zero during training:
With probability p, a neuron is dropped (its output set to 0)
This prevents co-adaptation of neurons and encourages independence
Multiply weights by (1−p) to match expected output during training
Bayesian interpretation:
Dropout can be viewed as an approximation to Bayesian inference over model weights
Monte Carlo Dropout (MC Dropout): at test time, apply dropout and average multiple predictions to estimate predictive uncertainty
https://docs.google.com/document/d/1xKWJCFlM3KlLbzrweY0kSmJIIfHdU_xZH6IHTHEVWGQ/edit?tab=t.0
Zoneout: a variation used in RNNs where some hidden states are copied from previous steps instead of recomputed