PAC Learning Flashcards
(44 cards)
What is proof by contraposition?

What’s the intuition of the VC Dimension?
Even if you have infinitely many hypotheses in your hypothesis class, given some training sample, many of those hypotheses will look functionally the same wrt that specified sample
What is h? What does it do?
- A hypothesis.
- Applied to some dataset S, generates a labeling of S
What is S?
the dataset
For the realizable setting, what are the relations between the bound and 1) epsilon, 2) abs(H)?
- Bound is inversely linear in epsilon (e.g. halving the error requires double the examples)
- Bound is only logarithmic in |H| (e.g. quadrupling the hypothesis space only requires double the examples)
For the agnostic setting, what are the relations between the bound and
1) epsilon,
2) abs(H)
- Bound is inversely quadratic in epsilon (e.g. halving the error requires 4x the examples)
- Bound is only logarithmic in |H| (i.e. same as Realizable case)
What is shattering?
The hypotheses in H can perfectly classify every possible labeling of S
What is the VC-dimension?
Def: The VC-dimension (or VaporikChervonenkis dimension) of ℋ is the cardinality of the largest set “ such that ℋ can shatter “.
Or the definition from the recitation. Get rid of one of these
VC dimension of a hypothesis space H is the maximum number of points such that there exists at least one arrangement of these points and a hypothesis h ∈ H that is consistent with any labeling of this arrangement of points
When is the VC dimension infinity?
If ℋ can shatter arbitrarily large finite sets, then the VC-dimension of ℋ is infinity
To prove that that VC(H) = some value M, what do you have to do?

What’s the vc dimension for separators in n dimensions?
n+1
What does ∃ mean?
There exists
(high level) What’s the corollary to Theorem 1 of the PAC theorem?
Give a numerical bound of the true error
What’s the key idea that makes Correlary 4 of theorem 1 of pac learning useful?
- We want to tradeoff between low training error and keeping H simple (aka low VC(H))
- We can tune the lambda parameter in the regularize to hopefully get us to land at the sweet spot in the graph

What are the practical ways we can tradeoff between low training error and keeping H simple? (1)
Use a regularizer
What are discriminative models?
What’s a pmf?
p(x) : Function giving the probability that discrete r.v. X takes value x.
What’s a pdf?
f(x) : Function the returns a nonnegative real indicating the relative likelihood that a continuous r.v. X falls in a certain interval
What’s a cdf? What’s the symbol?
F(x) : Function that returns the probability that a random variable X is less than or equal to x:
What does a beta distribution look like?

What does a dirichlet distribution look like?

What’s the symbol for expected value of x?
E[X]
What are the equations for expected value?

What’s the equation for variance (one that applies to both discrete and continuous)
