Regularization and Feature Selection Flashcards

1
Q

Regularized Loss Minimization

A

Another learning paradigm, picking h defined by a vector(w) and a regularization function R: R^d –> R

argmon (Ls(w) +R(w))

R(w) is a measure of complexity of hypothesis h defined by w, it balances between low empirical risk and less complex hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

L1 regularization, LASSO

A

Regularization function: R(w) = lambda ||w||1

l1 norm: ||w||1 = SUM |wi|

The learning rule is:

A(S) = argmin(Ls(w)+lambda ||w||1

  • lambda regulates the tradeoff between the empirical risk Ls(w) or overfitting and the complexity (||w||1) of the model we pick
  • ||w||1 measures the complexity of hypothesis defined by w

LASSO: linear regression with squared loss + l1 regularization

w=argmin lambda||w||1 + SUM(<w,xi> -yi)^2

l1 norm is a convex function and squared loss is convex –> problem can be solved efficiently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Tikhonov Regularization, RIDGE Regression, derivation of optimal solution for Ridge Regression

A

Regularization function: R(w) = lambda ||w||^2

l2 norm: ||w||^2 = SUM wi^2

The learning rule is:

A(S) = argmin(Ls(w)+lambda ||w||^2

  • lambda regulates the tradeoff between the empirical risk Ls(w) or overfitting and the complexity (||w||^2) of the model we pick
  • ||w||^2 measures the complexity of hypothesis defined by w

RIDGE: linear regression with squared loss + Tikhonov regularization

w=argmin( lambda||w||^2 + SUM(<w,xi> -yi)^2 )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Feature selection: subset selection, forward selection, backward selection, without and with validation data

A

Features selection: task of selecting a small number of features from a large pool, that will be used by the predictor. This prevent overfitting and a predictor can be done faster.

Its goal is to learn predictor using k &laquo_space;d predictors
Select k features that minimize the empiriacl risk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Subset selection

A

Subset selection: it select the set composed by k features that minimized the training error.

Pseudocode:

Let
- I = 1…d
-given p = {i1,…,ik} proper subset of I : Hp = hypothesis/ model where only features wi1,….,wik are used

P(k) <– {J proper subset of I : |J| = k}

foreach p in P(k) do
- hp <– argmin Ls(h)
return h(k) <– argmin Ls(hp)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Forward selection

A

Greedy algorithm, start from the empty solution, add one feature at the time, until solution has cardinality k

sol <– empty;
while |sol|<k do
- foreach i in I\sol do
- - p <– sol union {i}
- - hp <– argmin Ls(h)
- sol <– sol union argmin Ls(hsolunion{i})
return sol

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Backward selection

A

Greedy algorithm, start from the solution which includes all features, remove one feature at the time, until solution has cardinality k

Backward selection with validation:

for l <– 0 to k do
- p(l) <–{ J proper subset I : |J| = l}
- foreach p in P(l) do
- - hp <– argmin Ls(h)
- h(l) <– argmin Lv(hp)
return argmin Lv(h)

Backward selection with cross validation:

sol <– empty;
while |sol|< k do
- foreach i in I\sol do
- - p <– sol union {i}
- - hp <– argmin Ls(h)
- sol <– sol union argmin Lv(hsolunion{i})
return sol

How well did you know this?
1
Not at all
2
3
4
5
Perfectly