Chapter 6 Flashcards

Question 1

Q

what are the three main ingredients of machine learning?

Answer

A

a task
historical data/experience
performance measure

Question 2

Q

what is the definition of machine learning?

Answer

A

a computer program is said to learn from experience E with respect to some class of tasks T and performance P, if its performance at tasks in T improves with E

Question 3

Q

what do we actually learn in (supervised) machine learning?

Answer

A

we learn a functional relationship (unknown target function f) that maps observations to target values

> think of the unknown target function as the underlying generator of the data we observe

Question 4

Q

what is the “unknown conditional target distribution”?

Answer

A

unknown conditional target distribution: p(y|x)

> used to account for noise in the target

> describes the probability distribution of y assuming that x is fixed

> workaround to avoid deterministic function: calculate f(x) and add noise

Question 5

Q

what is the “unknown input distribution”?

Answer

A

unknown input distribution: p(x)

> describes how observations are distributed

Question 6

Q

what aspect becomes important when plenty of observed data is available?

Answer

A

> checking for unbalanced data

>>> draw stratified sample from data so that each class/label is equally likely

Question 7

Q

what is the relationship between the loss and the risk functional?

> what about empirical risk?

Answer

A

loss: e(f(x), h(x)) pairwise error messure between target value and estimated value

risk functional: E(f,h)

> loss*p(x) integrated over all X

>>> f(x) is unknown, impossible to calculate E(f,h)

therefore estimate:

empirical risk: averaged prediction error for all y, h(x)

Question 8

Q

confusion matrix: explain

> precision

> recall

Answer

A

precision: true positives/true positives + false positives

> if we predict true, how accurate are we

recall: true positives/true positives + false negatives

> how many of the true cases do we identify

Question 9

Q

how to combine precision and recall?

Answer

A

we want to perform good on both precision and recall

> combine: F-measure

F = (1+beta^2) * precision * recall / beta^2 * precision + recall

> beta: how precision and recall are weighted

Question 10

Q

what is the hypothesis set H?

Answer

A

H is a set of functions that contain all potential hypothesis h we consider to be candidates to fit the unknown target function well

> often: H is an infinite set of functions

example: regression - H is set of all linear functions on the input space X

Question 11

Q

why do we need to restrict the hypothesis space H?

Answer

A

less restrictions mean more potential candidate f^ to choose from, making selection harder

> Mitchell: without restricting the hypothesis space, learning as a selection of f^ from H is not possible

Question 12

Q

how to minimize in-sample error?

Answer

A

using gradient descent

Question 13

Q

how to implement gradient descent?

Answer

A

gradient descent of a multiparameter real-valued function is the vector of the partial derivatives of that function

> points into the direction of steepest ascent

> implement update rule that updates in the opposite direction of the gradient descent by a certain learning rate

>>> find minimum in function

Question 14

Q

how does the in-sample error relate to the update rule?

Answer

A

widrow hoff algorithm:

> the larger the in-sample error (hence the deviation of the hypothesis h(x) from the observed Y), the greater the magnitude of the update

> also called “least-mean square” training rule

Question 15

Q

why do we need to choose a specific and limited hypothesis set before running the learning algorithm?

Answer

A

if we would define the hypothesis set to contain the set of all functions mapping X to Y, then learning would not work

> we could construct an infinite number of functions that yield an in sample error of 0, but in general those would overfit the unknown target function

Question 16

Q

what does the ROC curve describe?

> how does the curve look for random classifier?

> how for good classifier?

Answer

Study These Flashcards

A

ROC curve:

> describes the trade-off between the fraction of true positives out of all positives versus the fraction of false positives out of all true negatives

aka

true positive rate (TPR) vs false positive rate (FPR)

> random classifier: diagonal

> good classifier: any curve left of the diagonal

Question 17

Q

what is theta_cut?

> how to select the right value?

Answer

Study These Flashcards

A

theta_cut: for binary classification, cutoff score, decides from which threshold value 1 is predicted by the model

> influences the ratio of mistakes

> in order to select theta_cut, we need to quantify the cost of true positives and false negatives

> given a cost matrix we can calculate the expected cost for all values of theta_cut >>> pick value that minimizes expected cost

Question 18

Q

what is regularization?

why do we need it?

Answer

Study These Flashcards

A

regularization: parameter that regulates how many features will be included in the model

> models with higher number of features are penalized

> why do we need it: if p > N, empirical risk minimization does not have a unique solution

Question 19

Q

why do the order of higher order polynomials in H need to be restricted?

Answer

Study These Flashcards

A

when increasing the degree of multivariate polynomials in H, then the hypothesis gets more powerful in capturing complex relations >>> decrease in-sample error

> at degree >= N-1, sample error will be 0

>>> overfitting

Question 20

Q

can we always approximate f arbitrarily well when N -> infinite?

Answer

Study These Flashcards

A

no!

> it is not guaranteed that f is in our hypothesis space

Question 21

Q

definition of PAC learnability

Answer

Study These Flashcards

A

PAC learnability:

hypothesis set is said to be PAC learnable, if a learning algorithm exists that fulfills the following condition

> for all epsilon > 0 and delta in [0,1] there exists an m so that for a random training sample larger than m it holds with probability 1-delta: out-sample error - in-sample error < epsilon

Question 22

Q

what does PAC learnability imply for every finite hypothesis set?

Answer

Study These Flashcards

A

every finite hypothesis set is PAC learnable

Question 23

Q

how to choose complexity of hypothesis set?

Answer

Study These Flashcards

A

> we want to decrease out sample error, using two objectives:

minimize in sample error
minimize difference between in and out sample error

>>> in sample error decreases with H complexity BUT

> to high complexity leads to overfit, high out sample error

> to low complexity leads to bias, high out sample error

(called bias-variance-tradeoff)

generally: the more training data, the more complex H can be

Chapter 6 Flashcards

(23 cards)