Topic 2: Empirical Risk Minimisation Flashcards

1
Q

What is F

A

The chosen family of models for a problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is F related to Ω

A

F is always a subset of Ω because we have always have a restricted choice of models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Ω

A

The space of all possible models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Empirical Risk minimisation generally

A

A mathematical framework to understand the theory of over-fitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the input space

A

X = R^d
real-values in space with dimension d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the output space

A

y = R^k
real values in space with dimension k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a datapoint denotation

A

(x,y) = R^d x R^k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is IID

A

“Independent and Identically distributed”
Each datapoint (x,y) is assumed to be an independent sample from the distribution D = P(x,y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Sn

A

The overall dataset
Assumed to be a sample from a joint random variable P(x,y)^n (sampling n times, independently)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Loss function mathematical denotation

A

l:R^k x R^k -> [0, inf]
meaning it takes two labels: true label and model prediction and returns a loss value of 0 upwards (not including infinity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is always the first argument of the loss function

A

The true label
aka l(y, f(x))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is empirical risk minimiser

A

ferm = arginf R^(f,Sn)

The model obtained if we could find the global minimum on a data sample Sn, when we are restricted to the model family F
(it is an estimated risk using an IID sample size n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is population risk

A

The true risk, all possible data we may never encounter
Consider n as infinity
Also known as the ‘generalisation error’ - expresses the error that a model f would make, on an average, over all possible inputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The Population risk minimiser in F

A

f* = arginf R(f)

  • indicates optimality
    “Best in family” model
    (now assuming we have infinite data)
    It is a hypothetical model, the optimal model with no restrictions on the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The population risk minimiser in Ω

A

y* = arginf R(f)

Also known as the Bayes model or bayes prediction (not to be confused with naive bayes)
This is a hypothetical model, the optimal model with no restriction on the model family and no restrictions on the data
Minimises the risk on all possible models (not restricted to a family)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the population risk minimiser for squared loss

A

For squared loss y* = Ey∣x[x]
(the expected value of y given x)

this is not computable as we do not have access to the full data distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the population risk minimiser for 0/1 loss

A

y* = argmaxy p(y∣x)

this is not computable as we do not have access to the full data distribution

18
Q

What is ferm

A

empirical risk minimiser
the global minimum of the empirical risk

19
Q

What is y*

A

the bayes model
the global minimum of the population risk

20
Q

What is f*

A

the best model in our family

21
Q

What is R() denotation

A

The population risk of…
R(ferm) = denotes the population risk of an ERM
R(f) = population of f
R(y*) = population risk of bayes model

22
Q

What is excess risk

A

R(f) - R(y)
This quantity tells us how much further we could hypothetically reduce the population risk, since we know it is impossible to do any better than the Bayes model y

Here f is generic, it could be specified to the ferm situation

23
Q

What is the Approximation/Estimation decomposition

A

R(ferm) - R(y) = R(ferm) -R(f * ) + R(f * ) - R(y )
Excess risk of ferm = estimation error + aproximation error

24
Q

What is Approximation error

A

error due to restricted model family unable to represent the bayes model
R(f) - R(y)

25
Q

What is Estimation error

A

Error due to having a small sample, where empirical risk is a poor estimation of population risk
R(ferm) - R(f*)

26
Q

Effect of model family size on approximation error

A

Approximation error decreases as a larger model family becomes closer to the bayes model

27
Q

Effect of model family size on estimation error

A

Estimation error increases because the model space becomes larger so finding the best model (f*) in the space is harder

28
Q

How is the excess curve related to under fitting and over fitting

A

Under fitting where the gradient is -ve (function class is too small, relative to amount of data) -> higher approximation error

Over fitting where the gradient is +ve (function class is too large, relative to amount of data) -> higher estimation error

29
Q

What is Optimisation Error

A

R(f) - R(ferm)

component of the excess risk caused by a poor learning algorithm
ferm assumes we can find the global minimum of the empirical risk on the training data set - but in reality we have a sub optimal model f
It can be -ve

30
Q

Is it possible to have zero optimisation error

A

Yes, some situations R(f) - R(ferm) = 0
Eg deep decision trees

31
Q

Approximation/Estimation/Optimisation decomposition

A

R(f) - R(y) = R(f) - R(ferm) + R(ferm) - R(f) + R(f) - R(y)
Excess risk of f = optimisation error + estimation error + approximation error

32
Q

What is X

A

Some input space

33
Q

What is Y

A

some output space

34
Q

what does subscript variable after argmin mean

A

Eg argmin y
means finding the value for y that minimises the function

35
Q

which R() value comes first in each of Approximation/Estimation/Optimisation terms

A

the one that does worse
eg approximation = R(f star) - R(y star)

36
Q

what is the approximation error in word form

A

The choice of model

37
Q

what is the optimisation error in word form

A

The choice of learning algorithm

38
Q

what is the estimation error in word form

A

The quality/amount of data

39
Q

why is the optimisation/estimation/approximation a difficult balance

A

All three components interact with each other so it is not an easy problem

40
Q

In the formula for population risk, why is an integral over x taken

A

integrates the loss function weighted by probability dist p(x,y) over all x values
Computes the average loss for a given true label y, considering all possible x inputs

41
Q

In the formula for population risk, why is an integral over y taken

A

computes average loss incurred by the model f (weighted by p(x,y)) when considering all possible combinations of x,y