Stepwise Selection Flashcards

1
Q

Definition

A

A statistical method to simplify a model by selecting only those factors that have predictive power

Best subset selection is performed by fitting all p models, where p is the total number of predictors being considered, that contain exactly one predictor and picking the model with smallest deviance, fitting all p choose 2 models that contain exactly 2 predictors and picking the model with lowest deviance, and so forth. Then a single best model is selected from the models picked, using a metric such as AIC. In general, there are 2p models that are fit, which can be quite a large search space as p increases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Notes - Information Criteria

A

Criteria are u-shaped as a function of flexibility; lower values are preferred. Compares value added by predictors against the number of predictors
p= # predictors in the model; n= # training observations

Akaike Information Criterion or AIC: SSE* + 2p
-For every additional predictor added to the model, AIC experiences a net decrease when its first term decreases by more than 2
-adding a variable requires an increase in the loglikelihood of two per
parameter added.

Bayesian Information Criterion BIC: SSE* + ln(n)p
-For reasonably sized training sets, ln(n) would exceed 2.
-For every additional predictor added to the model, BIC experiences a net decrease when its first term decreases by more than ln(n)
-the required per parameter increase is the logarithm of the number ofobservations.
-BIC is expected to reach its minimum at a p that is smaller compared to AIC; models with fewer predictors are favored when using BIC

SSE* (can be referred to as deviance) is a similar quantity to the SSE. As p increases, SSE* decreases.

Want lowest AIC/BIC!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Model Performance

A

Want lowest AIC/BIC

Compare to other models using test RMSE or another performance metric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Notes

A

Obeys the hierarchical principle

Factors:
–Default, does not consider dummy variables as separate factors
–Use binarization in order for stepwise selection to view them as unrelated, standalone predictors, such that they may be added/dropped individually.

Drop1 -> 1st round of backward with AIC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Forward Selection

A

Starting with the NULL model (no predictors), forward selection adds the best predictor, doing so one at a time until it yields no further improvement; ‘best’ is based on the chosen information criterion.

Each iteration compares the current model with each +1 predictor option available and picks the best one (the one that produces the lowest AIC/BIC).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Backward Selection

A

Starting with the biggest model (all predictors), backward selection drops the worst predictor, doing so one at a time until it yields no further improvement; ‘worst’ is based on the chosen information criterion.

Each iteration compares the current model with each -1 predictor option available and picks the best one (the one that produces the lowest AIC/BIC). Interaction term must be dropped before those individual predictors can be evaluated for dropping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Procedure Comparisons

A

Both perform variable selection. Avoids overfitting to the data, especially when the number of observations is small compared to the number of predictors.

Both are greedy, unlike best subset selection which checks all possible predictor combinations to find the best one.

An advantage of forward selection is that it works in high-dimensional settings.

An advantage of backward selection is that it maximizes the potential of finding complementing predictors.

Forward selection with BIC tends to result in fewer predictors; backward selection with AIC tends to result in more predictors.

vs Best Subset
-Stepwise selection is an alternative to best subset selection, which is computationally more efficient, since it considers a much smaller set of models. It is not guaranteed that stepwise selection will find the best possible model out of all 2p models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly