Linear Models Flashcards

1
Q

Linear predictors/models

A

Linear function : Ld = {hw,b : w in R^d, b in R}

Linear predictor : hw,b(x) : <w,x> + b = ( Sum wixi ) + b

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Linear Regression: definitions, matrix form, derivation best predictor, use of generalized inverse

A

Hypothesis class : Hreg = Ld = { x –> <w,x> + b : w in R^d, b in R}

Commonly used function: squared-loss : l(h,(x,y)) = (h(x) - y)^2

Empirical risk = training error = Mean squared error

Ls(h) = 1/m SUM (h(xi) - yi)^2

How we find a ERM hypothesis? Least Squares Algorithm:

algorithm that solves the ERM problem for the hypothesis class of linear regression predictors with respect to the squared loss

Best hypo: argmin Ls(hw) = argmin 1/m SUM (<w,xi> - yi)^2 (we want to find w!)

Equivalent formulation: w minimizing RSS :

argmin SUM (<w,xi> - yi)^2

so we compute the gradient of objective function with repsect to w and compare it to 0

We then obtain w = (XtX)^-1Xty, if XtX is invertible then this is the solution of our ERM problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Coefficient of determination

A

R^2, in regression is a statistical measure of how well the regression predictions approximate the real data points. An R^2 of 1 indicates that the regression predictions perfectly fit the data.
Is a measure how well h performs against the best naive predictor.

Obtained as the Sum of squares residual / total sum of squares.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Linear classification: perceptron

A

Used in binary classification problems h : R^d –> {-1,1}

Hypothesis class of halfspaces: Hd = sign o Ld = {x –> sign(hw,b(x)) : hw,b in Ld}

The instances that are above the hyperplane are labeled positively, below negatively.

The commonly used loss function is the 0-1

How do we find good hypothesis?
Good = minimizes the training error ERM

–> Perceptron algorithm: algorithm that find a good hypothesis implementing the ERM rule

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

VC-dimension of linear models

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Logistic regression

A

Used to learn a function h from R^d to [0,1], for classification tasks, h(x) is the probability that label of x is 1 in binary.

Hypothesis class H : sigmasig o Ld where sigmasig : R –> [0,1] is sigmoid function

The sigmoid is the S-shaped function where:

h(x) = 1 –> high confidence that the label is 1
h(x) = 0 –> high confidence that the label is -1
h(x) = 1/2 –> not confident about prediction

sigmoid(z) = 1 / (1+e^-z) = e^z / (1+e^z)

Hsig = sigmasig o Ld = { x –> sigmasig (<w,x>) : w in R^d}

Main difference with classification with halfspaces: when <w,x> sim 0
- halfspace prediction is 1 or -1
-sigmasig(<w,x>) sim 1/2 –> uncertainty in predicted label

Loss function: need to define how bad it is to predict hw(x) in [0,1] given that true label is y = +-1

l(hw,(x,y)) = log(1+exp(-y<w,x>))
if y =+1 then hw is large, y =-1 hw is small

Therefore, given a training set S the ERM problem for logistic regression is
argmin 1/m SUM log(1+e^-(yi<wi,xi>))

ERM formulation is the same as the one arising form maximum likelihood estimation,
MLE is a statistical approach for finding the parameters that maximize the koint probability of a given dataset assuming a specific parametric probability function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly