Chapter 6 Flashcards
(23 cards)
what are the three main ingredients of machine learning?
- a task
- historical data/experience
- performance measure
what is the definition of machine learning?
a computer program is said to learn from experience E with respect to some class of tasks T and performance P, if its performance at tasks in T improves with E
what do we actually learn in (supervised) machine learning?
we learn a functional relationship (unknown target function f) that maps observations to target values
> think of the unknown target function as the underlying generator of the data we observe
what is the “unknown conditional target distribution”?
unknown conditional target distribution: p(y|x)
> used to account for noise in the target
> describes the probability distribution of y assuming that x is fixed
> workaround to avoid deterministic function: calculate f(x) and add noise
what is the “unknown input distribution”?
unknown input distribution: p(x)
> describes how observations are distributed
what aspect becomes important when plenty of observed data is available?
> checking for unbalanced data
>>> draw stratified sample from data so that each class/label is equally likely
what is the relationship between the loss and the risk functional?
> what about empirical risk?
loss: e(f(x), h(x)) pairwise error messure between target value and estimated value
risk functional: E(f,h)
> loss*p(x) integrated over all X
>>> f(x) is unknown, impossible to calculate E(f,h)
therefore estimate:
empirical risk: averaged prediction error for all y, h(x)
confusion matrix: explain
> precision
> recall
precision: true positives/true positives + false positives
> if we predict true, how accurate are we
recall: true positives/true positives + false negatives
> how many of the true cases do we identify
how to combine precision and recall?
we want to perform good on both precision and recall
> combine: F-measure
F = (1+beta^2) * precision * recall / beta^2 * precision + recall
> beta: how precision and recall are weighted
what is the hypothesis set H?
H is a set of functions that contain all potential hypothesis h we consider to be candidates to fit the unknown target function well
> often: H is an infinite set of functions
example: regression - H is set of all linear functions on the input space X
why do we need to restrict the hypothesis space H?
less restrictions mean more potential candidate f^ to choose from, making selection harder
> Mitchell: without restricting the hypothesis space, learning as a selection of f^ from H is not possible
how to minimize in-sample error?
using gradient descent
how to implement gradient descent?
gradient descent of a multiparameter real-valued function is the vector of the partial derivatives of that function
> points into the direction of steepest ascent
> implement update rule that updates in the opposite direction of the gradient descent by a certain learning rate
>>> find minimum in function
how does the in-sample error relate to the update rule?
widrow hoff algorithm:
> the larger the in-sample error (hence the deviation of the hypothesis h(x) from the observed Y), the greater the magnitude of the update
> also called “least-mean square” training rule
why do we need to choose a specific and limited hypothesis set before running the learning algorithm?
if we would define the hypothesis set to contain the set of all functions mapping X to Y, then learning would not work
> we could construct an infinite number of functions that yield an in sample error of 0, but in general those would overfit the unknown target function
what does the ROC curve describe?
> how does the curve look for random classifier?
> how for good classifier?
ROC curve:
> describes the trade-off between the fraction of true positives out of all positives versus the fraction of false positives out of all true negatives
aka
true positive rate (TPR) vs false positive rate (FPR)
> random classifier: diagonal
> good classifier: any curve left of the diagonal
what is theta_cut?
> how to select the right value?
theta_cut: for binary classification, cutoff score, decides from which threshold value 1 is predicted by the model
> influences the ratio of mistakes
> in order to select theta_cut, we need to quantify the cost of true positives and false negatives
> given a cost matrix we can calculate the expected cost for all values of theta_cut >>> pick value that minimizes expected cost
what is regularization?
why do we need it?
regularization: parameter that regulates how many features will be included in the model
> models with higher number of features are penalized
> why do we need it: if p > N, empirical risk minimization does not have a unique solution
why do the order of higher order polynomials in H need to be restricted?
when increasing the degree of multivariate polynomials in H, then the hypothesis gets more powerful in capturing complex relations >>> decrease in-sample error
> at degree >= N-1, sample error will be 0
>>> overfitting
can we always approximate f arbitrarily well when N -> infinite?
no!
> it is not guaranteed that f is in our hypothesis space
definition of PAC learnability
PAC learnability:
hypothesis set is said to be PAC learnable, if a learning algorithm exists that fulfills the following condition
> for all epsilon > 0 and delta in [0,1] there exists an m so that for a random training sample larger than m it holds with probability 1-delta: out-sample error - in-sample error < epsilon
what does PAC learnability imply for every finite hypothesis set?
every finite hypothesis set is PAC learnable
how to choose complexity of hypothesis set?
> we want to decrease out sample error, using two objectives:
- minimize in sample error
- minimize difference between in and out sample error
>>> in sample error decreases with H complexity BUT
> to high complexity leads to overfit, high out sample error
> to low complexity leads to bias, high out sample error
(called bias-variance-tradeoff)
generally: the more training data, the more complex H can be