Chapter 6 Flashcards Preview

ML4QS > Chapter 6 > Flashcards

Flashcards in Chapter 6 Deck (23)
Loading flashcards...

what are the three main ingredients of machine learning?

1. a task 

2. historical data/experience

3. performance measure


what is the definition of machine learning?

a computer program is said to learn from experience E with respect to some class of tasks T and performance P, if its performance at tasks in T improves with E


what do we actually learn in (supervised) machine learning?

we learn a functional relationship (unknown target function f) that maps observations to target values

> think of the unknown target function as the underlying generator of the data we observe


what is the "unknown conditional target distribution"?

unknown conditional target distribution: p(y|x)

> used to account for noise in the target

> describes the probability distribution of y assuming that x is fixed

> workaround to avoid deterministic function: calculate f(x) and add noise


what is the "unknown input distribution"?

unknown input distribution: p(x)

> describes how observations are distributed


what aspect becomes important when plenty of observed data is available?

> checking for unbalanced data

>>> draw stratified sample from data so that each class/label is equally likely


what is the relationship between the loss and the risk functional?

> what about empirical risk?

loss: e(f(x), h(x)) pairwise error messure between target value and estimated value

risk functional: E(f,h)

> loss*p(x) integrated over all X

>>> f(x) is unknown, impossible to calculate E(f,h)

therefore estimate:

empirical risk: averaged prediction error for all y, h(x)


confusion matrix: explain 

> precision

> recall

precision: true positives/true positives + false positives

> if we predict true, how accurate are we

recall: true positives/true positives + false negatives

> how many of the true cases do we identify


how to combine precision and recall?

we want to perform good on both precision and recall

> combine: F-measure

F = (1+beta^2) * precision * recall / beta^2 * precision + recall

> beta: how precision and recall are weighted


what is the hypothesis set H?

H is a set of functions that contain all potential hypothesis h we consider to be candidates to fit the unknown target function well

> often: H is an infinite set of functions

example: regression - H is set of all linear functions on the input space X


why do we need to restrict the hypothesis space H?

less restrictions mean more potential candidate f^ to choose from, making selection harder

> Mitchell: without restricting the hypothesis space, learning as a selection of f^ from H is not possible


how to minimize in-sample error?

using gradient descent


how to implement gradient descent?

gradient descent of a multiparameter real-valued function is the vector of the partial derivatives of that function

> points into the direction of steepest ascent

> implement update rule that updates in the opposite direction of the gradient descent by a certain learning rate

>>> find minimum in function


how does the in-sample error relate to the update rule?

widrow hoff algorithm:

> the larger the in-sample error (hence the deviation of the hypothesis h(x) from the observed Y), the greater the magnitude of the update

> also called "least-mean square" training rule


why do we need to choose a specific and limited hypothesis set before running the learning algorithm?

if we would define the hypothesis set to contain the set of all functions mapping X to Y, then learning would not work

> we could construct an infinite number of functions that yield an in sample error of 0, but in general those would overfit the unknown target function


what does the ROC curve describe?

> how does the curve look for random classifier?

> how for good classifier?

ROC curve:

> describes the trade-off between the fraction of true positives out of all positives versus the fraction of false positives out of all true negatives 


true positive rate (TPR) vs false positive rate (FPR)

> random classifier: diagonal

> good classifier: any curve left of the diagonal


what is theta_cut?

> how to select the right value?

theta_cut: for binary classification, cutoff score, decides from which threshold value 1 is predicted by the model

> influences the ratio of mistakes

> in order to select theta_cut, we need to quantify the cost of true positives and false negatives

> given a cost matrix we can calculate the expected cost for all values of theta_cut >>> pick value that minimizes expected cost


what is regularization? 

why do we need it?

regularization: parameter that regulates how many features will be included in the model

> models with higher number of features are penalized

> why do we need it: if p > N, empirical risk minimization does not have a unique solution


why do the order of higher order polynomials in H need to be restricted?

when increasing the degree of multivariate polynomials in H, then the hypothesis gets more powerful in capturing complex relations >>> decrease in-sample error

> at degree >= N-1, sample error will be 0

>>> overfitting




can we always approximate f arbitrarily well when N -> infinite?


> it is not guaranteed that f is in our hypothesis space


definition of PAC learnability

PAC learnability:

hypothesis set is said to be PAC learnable, if a learning algorithm exists that fulfills the following condition

> for all epsilon > 0 and delta in [0,1] there exists an m so that for a random training sample larger than m it holds with probability 1-delta: out-sample error - in-sample error < epsilon


what does PAC learnability imply for every finite hypothesis set?

every finite hypothesis set is PAC learnable


how to choose complexity of hypothesis set?

> we want to decrease out sample error, using two objectives: 

1. minimize in sample error

2. minimize difference between in and out sample error

>>> in sample error decreases with H complexity BUT 

> to high complexity leads to overfit, high out sample error

> to low complexity leads to bias, high out sample error

(called bias-variance-tradeoff)

generally: the more training data, the more complex H can be