Topic 5: Regularisation & Hyperoptimisation Flashcards

Question 1

Q

What are the three dataset partitions in deep learning, and what are their roles?

Answer

A

Train, validation and test

training data for online learning:
- use the validation set for monitoring, and for hyperparameter tuning
- shuffle the training set → successive training examples belong rarely to the same class
- present samples that produce a large error more frequently
- check for catastrophic forgetting

basic assumption: training and validation data from same distribution:
- covariate shit:
- training and test input distributions are different but target function remains unchanged → weak extrapolation
- possible solutions: assining weights to data based on “importance” → rebalance data distributions

high bias: more or better units/features

high variance: need more data

Question 2

Q

Explain the concepts of underfitting and overfitting. What causes them, and how can we detect and address each?

Answer

A

underfitting - we have a model that doesn’t fit our problem well
- the model performs poorly
- the training set error is significantly larger than the expected of the ideal model

remedy:
- increase the model’s complexity (the parameters)
- add more features
- train for a longer time

overfitting - the model has seen such a specific perspective of the data, and therefore is very ungeneralisable
- 0% error by memorisation
- memorisation ≠ generalisation
- all noise is captured
- too many parameters

remedy:
- more training data
- reducing the model’s complexity
- cross validation, early stopping, regularisation

Question 3

Q

What is early stopping, how is it implemented, and why is it effective?

Answer

A

we use the validation set for this. independent data for deciding when to stop. we stop when the validation loss is less than the minimum improvement in during the last patientce epochs (e.g. 5 epochs to determine when to stop).
it helps prevent overfitting. it could be good to stop before it gets bad (so stop at 25 epochs, e.g.). we use the validation data, to help decide when we should stop

Question 4

Q

What is L1?

Answer

A

L1 regularisation adds a penalty to the absolute values of weights, encouraging sparse solutions, many weights become exactly zero:
https://docs.google.com/document/d/1-ScjmBUOUDy1w0yxU2KCRwGlGty3tsqqMM42DYhY6B4/edit?tab=t.0

key features:
- it ignores noisy inputs
- has a built-in feature selection
- $w_{k,l} = 0$, discards an input $x_k$
- it leads to many zeros → it creates sparse weight vectors
- Similar to LASSO regression
w = weights, x = input, d = dimension of the weight vector

L1 regularisation results
L1 regularisation encourages many weights to become exactly 0, and this causes:
- that connections get pruned in the network
- Many neurons stop receiving or sending signals
- The network becomes simpler

in modern neural network frameworks:
- tensors become evaluated in parallel, so pruning doesn’t make it faster
- sparse matrices are not cheaper or faster

Question 5

Q

What are the differences between L1 and L2?

Answer

A

Comparison:
Feature: L1
Penalty Type: Absolute values
Sparsity: Yes (zero weights)
Optimisation: Can lead to non-differentiability at 0
Use Case: Feature selection

Feature: L2
Penalty Type: Squared values
Sparsity: No (all weights shrink)
Optimisation: Smooth everywhere
Use Case: Smooth generalisation

Question 6

Q

What is L2?

Answer

A

L2 regularisation (a.k.a. Tikhonov regularisation) adds a squared penalty to the loss:
https://docs.google.com/document/d/1-ScjmBUOUDy1w0yxU2KCRwGlGty3tsqqMM42DYhY6B4/edit?tab=t.0

it penalises weight vectors that are far from 0 more, even if they’re already pretty small, we still penalise the values that might peak
- it encourages smaller weights, and leads to weight decay
Similar to LASSO regression

there is an optimum if all $w_{k,l} = 0$ (when all the weights are 0), but that also leads to a bad loss (high, because the model can’t learn anything from data → predictions are meaningless.

Question 7

Q

What are some other ways of regularisation?

Answer

A

Dropout = Stochastic removal of hidden units (stochastic = Random but with a controlled probability distribution.)

you set some connections/neurons to 0, with a probability, so e.g. neuron 4 has a probability of 50% to “turn off”.

it prevents dependency, and makes the neurons more independent (avoid co-adaptation)

it behaves like bernoulli noise (Each neuron is multiplied by a random variable drawn from a Bernoulli distribution)

there are variants to this:

dropout at test time: Monte Carlo (MC) dropout
- Perform multiple forward passes with dropout enabled
- You run the network multiple times with different dropout masks, then average the outputs to approximate the true distribution over predictions.
- Instead of doing exact Bayesian inference (which is often very hard or impossible in deep learning), we can use a Gaussian process (GP) as anapproximation to model the uncertainty in predictions.

zoneout: stochastically removes timesteps
- At each timestep, some hidden units are randomly “frozen” (i.e., not updated).
- Instead of computing a new value for that hidden unit, we just reuse the value from the previous timestep.
- This happens with a certain probability ppp, just like dropout.

Question 8

Q

What is batch normalization, and how does it address internal covariate shift?

Answer

A

by normalising activations (normalising the outputs of activations) to zero mean and unit variance:
- Zero mean
- Unit variance

this combats the “internal covariate shift”: The change in the distribution of activations (inputs to a layer) during training, as the parameters of previous layers change.

normalising is by e.g. using:
- mini-batch mean
- mini batch variance
- scale and shift
- normalise

advantages:
- reduced dependence of gradients on weight initialisation and scaling
- changes the input distribution e.g. after dense or convolutional layers

Question 9

Q

What is the role of parameter initialization in deep learning, and how do initialization schemes like Xavier or He help?

Answer

A

Improper weight initialisation leads to vanishing or exploding gradients, making learning unstable or impossible.
Common schemes:
LeCun Initialization: σ^2 = 1 / n_in
Best for sigmoid activations

He Initialization:
σ^2 = 2 / n_in
Designed for ReLU activations

Xavier/Glorot initialisation:
σ^2 = 2 / (n_in+n_out)
Balances signal variance between input/output layers; good for tanh

The goal is to preserve variance of activations across layers so that signal doesn’t explode or vanish.

Question 10

Q

How can we find quantitative Characteristics of the Trained Model?

Answer

A

Comparing metrics averaged over test set

different models
different data sets
different hyperparameter settings

Sometimes class-specific:

Metrics are task-specific & not always meaningful

Question 11

Q

How can we find qualitative Characteristics of the Trained Model?

Answer

A

Visualise your input and output!
Picking cherries and lemons

Representative good and bad results
Results for edge case
Peculiar results (where we have an explanation or no clue)

Question 12

Q

What are the major challenges in neural network optimization, especially with SGD?

Answer

A

SGD-specific issues:
- Ill-conditioned gradients: gradients vary in scale → unstable updates
- Plateaus: flat loss regions → slow learning
- Saddle points: gradients = 0 but not a minimum → training stalls
- Noisy gradients: mini-batches provide noisy estimates

Deep architecture issues:
- Exploding gradients: large updates destabilize training
- Vanishing gradients: small updates → no learning in early layers
- Cliffs: sharp areas in loss → instability
- Long-term dependencies: in RNNs, early signals get lost

Solutions:
- Proper initialisation of the weights
- BatchNorm - a normalisation technique used to make training of artificial neural networks faster and more stable by adjusting the inputs to each layer. re-centering them around zero and re-scaling them to a standard size.
- Gradient clipping
- Using adaptive optimisers (Adam, RMSprop)

Question 13

Q

What are meta-/hyperparameters in deep learning, and how do we optimize them?

Answer

A

with increased model complexity comes an increase in the number of hyperparameters

hidden units, number of layers, activation function, convolution (stride/filters), epochs, learning rate, batch size, optimiser, regularisation, momentum

we optimise these by monitoring the validation loss

systematic procedures:
- Intuition + Grid Search (most common)
- Random Search (if we have no idea)
- Bayesian Optimization (best in theory)
- Evolutionary Algorithms, Gradient-based, ..

Question 14

Q

How can we find characteristics of training?

Answer

A

Observations:
- “Wobbly”, oscillations
- Spikes somewhat periodic
- Converges but each wrong decision takes a short time to correct

Possible explanation:
- Online learning with unshuffled data
- Possibly imbalanced

Things to try:
- Shuffle training samples
- Increase batch size
- Lower learning rate (decay)

Question 15

Q

Describe the L1 regularisation. Give us the formula

Answer

A

L1 regularisation is a technique used to prevent overfitting in machine learning by adding a penalty on the absolute values of model parameters (weights).

Given a loss function (e.g., Negative Log-Likelihood or MSE), L1 regularisation adds a penalty term: (both formulas) https://docs.google.com/document/d/1-ScjmBUOUDy1w0yxU2KCRwGlGty3tsqqMM42DYhY6B4/edit?tab=t.0
- ||𝑤||_1 is the L1 norm = sum of absolute values of the weights
- 𝜆 is a hyperparameter that controls the strength of the penalty

L1 regularization shrinks weights.

But more importantly, it often drives some weights exactly to zero.

This leads to sparse models, meaning some features are “ignored” (i.e., their weight is 0).

Question 16

Q

Explain how dropout works as a regularisation technique and its Bayesian interpretation.

Answer

Study These Flashcards

A

Dropout randomly sets a subset of activations in a layer to zero during training:
With probability p, a neuron is dropped (its output set to 0)

This prevents co-adaptation of neurons and encourages independence

Multiply weights by (1−p) to match expected output during training

Bayesian interpretation:
Dropout can be viewed as an approximation to Bayesian inference over model weights

Monte Carlo Dropout (MC Dropout): at test time, apply dropout and average multiple predictions to estimate predictive uncertainty
https://docs.google.com/document/d/1xKWJCFlM3KlLbzrweY0kSmJIIfHdU_xZH6IHTHEVWGQ/edit?tab=t.0

Zoneout: a variation used in RNNs where some hidden states are copied from previous steps instead of recomputed

Topic 5: Regularisation & Hyperoptimisation Flashcards

(16 cards)