C3 Flashcards

1
Q

what are the two parts of two-part MDL?

A

first part: use some optimal code to specify the right Turing machine, x

second part: use some optimal code to specify the right content for the tape, y

“right” means that the specified Turing machine and tape content combination outputs the given string

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

3 problems with Kolmogorov complexity

A
  1. uncomputability (halting problem)
  2. large constants (depends on the universal TM we choose)
  3. Universal TM is extremely expressive, but this leads to serious problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

MDL

A

Minimum Description Length principle:
given a set of models M, the best model m in M is the one that minimises L(M) + L(D|M)
with L(M) the length in bits of the description of M and L(D|M) the length in bits of the description of the data when encoded with M

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

example: describing polynomials

A

L(M) refers to the coefficients at a certain precision
L(D|M) is the length needed to encode the residuals: the differences between the actual values and the values given by the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

model class (statistics)

A

all polynomials

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

model (statistics)

A

all polynomials of kth degree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

point hypothesis

A

one specific polynomial (with coefficients specified)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

problem with crude two-part MDL

A
  • L(M) needs to be defined
  • multiple code words for a single dataset exists, which cannot be optimal (two-part code is incomplete)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

universal coding

A

associate a code not with a single model, but with all m in the set of models => construct a code L(D|M) that is small whenever there is an m in M that fits the data well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

refined MDL

A

uses universal codes (guaranteed to be minimax optimal)
developed for three tasks:
1. model selection
2. point hypothesis selection
3. prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Kraft’s inequality

A

for alphabet S, any prefix codee

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Kraft’s inequality

A

for alphabet S, any prefix code C must satisfy sum_x in S (s^-L_C(x) <= 1
=> code lengths are probabilites, a probability distribution is a code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Normalised Maximum Likelihood (NML) code

A

example of a universal code: the code given by the equalised distribution of maximum likelihoods
P_nml(D) = P_{tHATa(D)} (D) / sum_{D in D} P_{tHATa(D)} (D)

likelihood distribution P_theta is the probability of any data given parameter(s) theta and the likelihood estimator tHATa (D) = argmax_{theta in theta} P_theta (D)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

prequential plug-in code

A

we want to encode messages over some alphabet, but initially we don’t know anything
1. start with some low count and define probabilities accordingly (we don’t like probabilities of 0)
2. transmit symbols one by one and update counts, meanwhile compute probabilities

this converges to the optimal maximum likelihood estimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what does MDL give us?

A

MDL gives us the best model within the class of models considered (more modest than Kolmogorov complexity, but we can compute it). It does not assume one model to be the “true” model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

benefits of MDL

A
  • no overfitting (Occam’s razor)
  • non-parametric modelling possible (often desirable in realistic scenarios)
  • can also be used beyond prediction