L2 Linear Regression Flashcards
(6 cards)
Linear Regression, definition
Predict yˆ \in R (label, response) from x \in R^d (features, covariate)
Least squares model: yˆ = w_1^⊤ x + w_2 (bias), where w_1 \in R^d and w_2 \in R (that is, w \in R^{d+1})
Learning: choose (w_1, w_2) based on data ((x(i), y(i)))^N_{i=1}.
Prediction: given x, predict yˆ = w_1^⊤ x + w_2.
- Closed form solution
- Gaussian probability model
- Ideal for regression, often not well suited for classification
Linear Regression, learning
arg min(w1 in R^d, w2 in R) 1/N Sum^N_{i=1} 1/2 (w1^T x(i) + w2 - y(i) )^2
Simplification: arg min(w in R^{d+1}) 1/2 ||Xw - y||^2_2
Solving gives OLS (Ordinary least squares) estimator wˆ = (X^T X)^{-1} X^T y (when it exists)
Linear Regression, problems/solutions
Not inversible if (X^T X)^{-1} does not exists. Does not exists if n < d+1
- Pseudoinverse (X^⊤ X)^† X^⊤ y = X^† y – still satisfies the “derivative condition” i.e. (X^⊤ X)wˆ = X^⊤ y
- Ridge regression (regularisation, to make sure there are no null eigenvalues): arg min(w∈{Rd+1}) 1/2 ∥Xw − y∥^2_2 + λ/2∥w∥^2, giving w ̃ = (X^⊤ X + λI)^{−1} X^⊤ y
If \lambda \to \infty, w_i \to 0.
Linear Regression, justifications/interpretations
- Geometric interpretation, the residual Xwˆ − y is orthogonal to span z1,..zd+1 because X^⊤(Xwˆ − y) = 0.
- Probabilistic model: y | x ~ N(w^⊤ x, σ2) –solve = maximize likelihood
- Loss minimization (ERM)
Empirical Risk Minimization
l_ls(y, yˆ) = 1/2 (y − yˆ)^2 is the least squares loss
ERM: arg min(f) 1/N Sum^N_{i=1} l(y(i), f(x(i)))
Least squares classification
Suppose y ∈ {−1, +1} ( i.e. binary classification) with classification error loss 1[y ̸= yˆ] ≈ 1[yyˆ ≤ 0]
Strategy: Choose w to minimize least squares loss lls(y, yˆ) = (y − yˆ)^2/2.
if y ∈ {−1, +1}, (y − yˆ)^2/2 = (y^2)(1 − yyˆ)^2/2 = (1 − yyˆ)^2/2.
Predict sgn(wˆ⊤ x) ∈ {−1, +1}