midterm Flashcards
(118 cards)
When would boosting overfit?
If the weak learners overfit! A bunch of things that overfit when combined can only overfit as well.
The other case is when there is pink noise (uniform noise)
How is SVM like KNN?
SVM like KNN except you done the work of figuring out which point matters, only local points maters, you don’t need all of the points.
KNN -> point that are closer together ought to have similar labels
What is a notion of similarity in SVM?
In the quadratic equation, there is a term: xT_i*x_j which is the dot product of the vectors. This tells you whether they are orthogonal (0), in the same direction (large +) or opposite directions (large -). They are a form of similarity.
What is the kernel trick?
K(x_i, x_j), Your similarity metric. It’s where you inject domain knowledge. This is a dot product that projects your data into a higher dimensional space where it can (could?) be linearly separable.
You don’t actually have to change your data into that dimension, though, because of the kernel trick.
The kernel should really capture your domain knowledge.
The datapoints (x_i, x_j) can be continuous or discrete!
What is an SVM kernel?
The kernel function is a function which can be represented the dot product of the mapping function (kernel method)
What is the Mercer Condition?
“It acts like a distance or a similarity” which means it’s a well behaved distance function.
How is boosting related to SVM?
As you add more learners, your error doesn’t necessarily change but you increase your confidence. This is like the margin in SVM.
Why do you want large margins in SVM?
Large margin mean the ability to generalize better an avoid overfitting
What is Version space in ML?
Set of all hypotheses that are consistent with the data.
What are mistake bounds?
How many misclassifications can a learner make over an infinite run.
1) Assume possible each variable is positive AND negated (A && NOT A && B && NOT B && …)
2) Given input, compute output, you guess FALSE
3) first time you’re wrong, keep bit pattern in your positive/negated table.
4) every time you’re wrong thereafter, set all positive variables that were 0 to absent, negative variables that were 1 to absent, goto 2
The idea is that whenever you see a true response, any bit patterns that are flipped when compared to the previous true response must be absent because they don’t contribute to the output.
You will make k + 1 mistakes at most (k being number of hypotheses).
What is computational complexity?
How many computational effort is needed for a learner to converge
Why boosting hard to overfitting?
1) giving high weight to low confidence to be high confidence
2) maximize margin meaning no overfitting because it will be able to generalize well
What is Haussler’s theorem?
a
What is the training error vs True error of a hypothesis?
Training error -=> is the fraction of training examples misclassified by h,
True error -> is the fraction of examples that would be misclassified on sample drawn from D
How do you find the VC dimension?
a
How do VC dimensions relate to PAC?
a
What is the VC of finite H space?
a
What is Haussler’s theorem?
Is a away to Bounding the true error as function of training example.
m >= 1/epsilon * (ln |H| + ln 1/delta)
(delta is the failure probability)
(epsilon is the error parameter)
There is a concern as the number of sample depend on the size of the hypothesis (what happen if you have large hypothesis space)
What are VC dimensions?
“What kind of data can this classifier classify?”. This fundamental question can be answered using a classifier’s VC dimension.
“the cardinality (size) of the largest set of points that the classification algorithm can shatter”
How do you find the VC dimension?
You put some number of points down and then no matter how you label those points, you must be able to separate them.
VC is the number of parameters! For a d-dimensional hyperplane, VC = d + 1 1 dimension has 1 VC (theta) interval has 2 VC (a,b) two dimension has 3 VC (W in R2, theta) three dimension has 4 VC
How do VC dimensions relate to PAC?
H PAC-learnable if and only if VC dimension is finite. You can’t PAC learn it if the VC dimension is infinite.
What is the VC of finite H space?
There is log relation between the size of finite hypotheses class and the VC dimensionality d = VC(H) <= log_2 |H|
How does the VC dimension relate to sample complexity?
m >= 1/epslion(8.vc(H).log(13/epslion ) + 4 log 2/delta ) –> infinite case
Like the Haussler’s theorem, except instead of ln |H| you use VC(H)… and there are many other changes.
The point is you can calculate a minimum number of samples to be epsilon exhausted if VC(H) is finite.
What’s Bayes Rule?
Pr(h|D) = ( Pr(D|h).Pr(h) )/ pr(D)
Pr(D) -> Prior on the data
Pr(D|h) -> easy to think about
pr(h) -> prior on h (our domain knowledge)
Allows you to swap what’s on the sides of the bar’s.