Topic 4: The Bias-Variance Decomposition Flashcards

1
Q

What is the variance of a model a measure of

A

Sensitivity to the training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the bias of a model

A

how far the cluster of predictions are from the target
roughly translates to a measure of strength in the predictor

High bias -> centred around a point that is not the bullseye target)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a joint random variable

A

A variable taken from P(x,y)^n
Joint = two variables x and y
n = n observations taken of (x,y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the expected squared risk

A

ESn[R(f)] = ESn[ E(x,y) [(f(x) − y)2] ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is ESn

A

Average of all possible training datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is E(x,y) ~ D

A

The average over all possible testing points
The random variable (x,y) follows a certain probability distribution D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the Bias Variance decomposition for squared risk

A

ESn[R(f)] = Ex (noise + bias + variance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the noise term

A

Ey∣x[ (y − Ey∣x[y])^2]

An irreducible constant, independent of any model parameters
Caused by choice of data/features and not by the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the bias term

A

(ESn [f(x)] − Ey∣x[y])^2

This is the loss of the expected model against Ey|x[y]
The expected model (ESn [f(x)]) is the average response we would get if we could average over all possible training data sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the variance term

A

ESn [ ( f(x) − ESn [f(x)] )2 ]

Compares a standard prediction f(x) with the average ESn [f(x)] and then takes the squared average
Captures variation in f due to different training sets, varying around the expected model
Model too flexible -> will grow large

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you reduce the bias

A

Increase the flexibility of the model
So increase the model family size

Potentially can be reduced by adding more features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you reduce the noise

A

Can only reduce it by getting better quality labelled data (not by increasing data size)
It is equal to R(y*) = Bayes risk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you reduce the variance

A

(Potentially)
Increasing the number of training examples
Adding some regularization to the model
Bagging algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What other losses does the bias variance decomposition hold for

A

squared loss
cross entropy loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the relationship between bias-variance decomposition and approximation-estimation decomposition

A

They are not equal but strongly related
Noise is equal to bayes risk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the most common loss function used to train neural networks

A

Cross entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does Ey∣x [y] mean

A

The average value of y, given that x is assigned the value x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is f with a small circle above it

A

represents a new function, a modified version of f

19
Q

What is ℓ(y, f(x))

A

A (non-negative) loss function, evaluated at a point x, y

20
Q

What is the geometric mean

A

represents the central tendency of a finite set of real values
Calculated by:
GM=(i=1∏n xi) ^1/n

​To normalise it: divide by some constant so the resulting distribution integrates to 1

21
Q

what is ℓtrain(f)

A

Training error

22
Q

Cross entropy vs squared risk B-V decomposition

A

For cross entropy, the geometric mean takes the place of the arithmetic mean (squared risk)
So we no longer have an ‘expected model’ but instead a ‘centroid model’

23
Q

What happens to bias and variance as the depth of a regression tree increases

A

Bias decreases
Variance increases

24
Q

What sort of bias and variance does linear regression exhibit

A

If a true relationship is too complex, linear regression will exhibit high bias leading to underfitting
Variance is generally low in linear regression

25
Q

How does bias relate to fitting

A

High bias -> underfitting
If a model is too simple for the data it has high bias and underfitting

26
Q

How does variance relate to fitting

A

High variance -> overfitting
Model is too complex and capturing too much noise

27
Q

What sort of bias and variance do decision trees exhibit

A

Trees can have low bias due to complexity
They are prone to high variance
Techniques like pruning can control variance and avoid over-fitting

28
Q

What sort of bias and variance does kNN exhibit

A

Low bias, especially in complex, non-linear datasets
May suffer from high variance due to noisy data
Choosing the appropriate k value helps manage the tradeoff

29
Q

What sort of bias and variance do neural networks exhibit

A

Can model highly complex relationships with low bias
Prone to high variance if network is too large or trained for too long

30
Q

What is the Over-paramterisation ratio

A

With p parameters to learn, and n training points, the overparameterization ratio is ρ = p/n A model is said to be over-parameterized if
ρ > 1, i.e. p > n
NOTE ρ ≠ p

31
Q

For huge neural nets, what can we say about ρ

A

Often ρ&raquo_space; 1
aka p&raquo_space; n

32
Q

What is monotonic?

A

something that does not vary or change

33
Q

What is non monotonic

A

something which can vary according to the situation or condition
Eg Variance in deep neural networks

34
Q

What is “Double descent”

A

Usually, as deep neural networks increase in complexity, the risk decreases slowly then faster
Instead of just the classic U shape where risk decreases to a sweet spot then increases again due to overfitting

After it increases again it then drops again (second descent)
This is thought to be due to networks being implicitly regularised due to stochastic gradient descent (we don’t know)

35
Q

Does the bias variance decomposition hold for all losses

A

NO
eg it does not hold for 0/1 loss

36
Q

What does it generally mean when a model has low bias

A

complex
flexibilty - have enough capacity to fit the training data closely, often resulting in low error on the training set
few assumptions - make fewer assumptions about the underlying data distribution allowing them to learn complex functions

37
Q

What does it generally mean when a model has high bias

A

simple
underfitting
not enough parameters to capture data

38
Q

What does it generally mean when a model has low variance

A

consistent - often produce similar predictions across different datasets
robust - less sensitive to small fluctuations or noise in the training data
model’s ability to generalize from the training data to unseen data is strong - captures underlying patterns

39
Q

What does it generally mean when a model has high variance

A

overfitting
sensitive to noise
poor generalisation - perform badly on unseen data

40
Q

What can be said of variance in the bv decomposition

A

Variance is independent of y

41
Q

what can be said of bias in the bv decomposition

A

it is the loss of a predictor ◦q = argminY ED(l (z,q))
not dependent on any particular training set

42
Q

what is Ey|x[y]

A

The average true label (for a given input x) if we had perfect knowledge of the underlying distribution of labels

43
Q

how do we control the bias variance tradeoff in linear regression

A

With l2 regularisation

44
Q

what techniques help reduce variance in neural networks

A

dropout, early stopping and implicit regularization help manage variance