Topic 10: Rademacher Complexity: a measure of capacity Flashcards

1
Q

What is Rademacher Complexity

A

A concept that helps us measure how well our model will do on new, unseen data without being tied to any specific training method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What can be said about the space of all possible f that an algorithm is effectively searching through

A

It can be infinite-dimensional space

Whilst the space of its parameters is not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can we define F in terms of the algorithm

A

F ∶= the set of all possible models(predictors) mapping Rd → Rk,
that a given training algorithm is searching over

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is ALG

A

Any chosen training algorithm which takes as input the 3−tuple (ℓ,Sn, F) and searches through the above F for the “best” predictor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is inf f∈F R(f)

A

The infimum (greatest lower bound = lowest lower bound) of the loss function
It is precisely want we want the loss function to be - as low as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What randomness do we account for

A

The randomness inherent in the training algorithm and introduced by sampling the training dataset from the entire dataset

There may be other sources of randomness that we’re not taking into account

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is stochasticity

A

Randomness or unpredictability in a process or system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is arginf

A

The argument of the infimum. In mathematical terms, it represents the input value(s) that correspond to the infimum (the greatest lower bound) of a given function
ferm is the arginf of Rˆ(f,Sn)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to we account for all randomness in ALG

A

Replacing ferm with ALG and average over all possible random outcomes of the algorithm to get a more comprehensive measure of risk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What do we know about R(ALG(ℓ,Sn, F))

A

Not much
it is a random variable that we do not know how to control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the worst generalisation gap between population and empirical risk

A

E(xi,yi)∼D [sup (Rˆ(f,Sn) − R(f))]

Trying to control this instead frees us from the details of any particular ML trying to find arginfR(f)

We instead are focused on the broader concept of how choices of predictors F and datasets D interact to find the best predictor f

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Empirical Rademacher Complexity

A

R^n(H) =
Eϵ [sup 1/n ∑ϵi h (zi) ]

The empirical Rademacher complexity tells us how well the functions in H can adapt to random fluctuations in the dataset

It checks how much these functions change when we add random noise, and then taking the average of the maximum change across all functions in H (ϵi is the random noise)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Average Rademacher Complexity

A

Rn(H) =
E Sn∼Dn (Eϵ [sup 1/n ∑ϵi h(zi) ] )

A way to measure the complexity of the function class H when considering all possible datasets drawn from the distribution D

note z = (x,y) and h(z) = ℓ(y, f(x))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does Sn ∼ Dn mean

A

The elements of Sn are sampled independently and identically from a common distribution D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the Expectation of the Worst Generalisation Gap equation

A

ESn∼Dn [sup 1/n ∑h(zi) − Ez∼D [h(z)] ) ] ≤ 2Rn(H)

The theorem is about understanding how well the functions in H generalize to unseen data compared to how they perform on the data they were trained on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does ℓ ○ F mean

A

Applying the loss function ℓ to each function in F
R = ℓ ○ F for some class of predictors F

17
Q

What is the upper bound of the worst generalisation gap

A

2Rn(H)
An upper bound (H) that is dependent on the loss, the predictor class, the distribution, and the number of samples - useful

18
Q

How do we limit rademacher complexity

A

Use a function C(H,D) to set an upper limit on this complexity

Ideally Rn(H) is less than or equal to C(H,D) / √n

where n is the number of training samples

It shows our model isn’t overly complex for our data size

19
Q

What is sample complexity

A

An estimate of how many training samples we need for a certain prediction accuracy

Use
n ≥ ( 2C(H,D) / ϵ )^2

Where ϵ is the worst case prediction error

20
Q

When are sample complexity estimates wildly overestimated

A

When the function class F gets more complicated

21
Q

What does C tell us about the number of parameters

A

C is not dependent on the number of parameters in the functions in H
So we can vary parameters without changing C

22
Q

What is Symmetrisation

A

Introduces additional symmetry into a problem
Eg creating a “ghost sample” of data points that are sampled independently from the same distribution as the original data

Symmetrical datasets = sets of data where the statistical properties are the same when certain transformations are applied

23
Q

Why do we take the supremum in rademacher complexity

A

This operation gives us the maximum value of a certain expression across all functions in H
By taking the supremum, we are essentially finding the function in H that maximally contributes to the expression which helps us understand the worst-case scenario or the maximum complexity

24
Q

Why do we take the expectation in rademacher complexity

A

we take the expectation over a set of random variables that represent noise
this gives us the average behavior of the supremum over different realizations of random noise (the typical behavior of the model)

25
Q

In the supervised setting, what is z in rademacher complexity formula

A

(x,y)

26
Q

In the supervised setting, what is h(z) in rademacher complexity formula

A

ℓ(y, f(x))

27
Q

What is H

A

H = real valued function class

28
Q

What is ϵ

A

ϵ = n−dimensional random vector where all its coordinates are either +1/-1

29
Q

What is 1/n ∑h(zi)

A

the average performance on the sampled data

30
Q

What is Ez∼D [h(z)]

A

the average performance on all possible data