Week 8 Flashcards
(19 cards)
Two main reasons we want to limit factors
1) Overfitting. WHen # factors is close to or later than the # of data points than the model might fit too closely to random effects. Few data points can lead to overfitting, casting bad estimates
2) Simplicity: SImple models are better than complex ones. Less data is required. LEss chance of insignificant factors. Easier to interpret.
Illegal factors for credit decisions
1) Race, sex, religion, marital status for credit decisions
2) Can;t use factors highly correlated with forbidden ones
3) Hard to demonstrate that a complex model is ok
Forward selection
Start with a model that has no factors. At each step, we find the best new factor to add and put it in as long as it is a good enough improvement. If there are no factors good enough to add or when we have added as many factors as we want, we stop. We can optionally remove factors that aren’t good enough. The definitions of good and good enough are parameters that we can set. It is common to allow new factors to enter the model if p value is below than .1 or .15 just for exploration. When it is time to remove factors , we may remove factors that have a pa value greater than .05.
Backward selection
Opposite of forward selection. Start with a model of all factors and at each step we find the worst factor and remove it from the model We repeat until there aren’t factors bad enough to remove and the model doesn’t have more factors than we want
STepwise regression
Start with all factors or no factors. At each step we add or remove a factor. After adding each new factor and at the end, we eliminate all factors that no longer appear to be good. Allows the model to adjust if a factor that we earlier thought we needed no longer seems necessary thanks to new factors added to the model.
Who do we choose good variables?
p value, r squared, bic, aic
Stepwise selection
Decisions are made step by step
Known as greedy algorithm,
At each step take one thing that looks best
Future options are not considered
DO we need to scale data before using Lasso?
Yes
How do we choose T
Depends on number of variables
Quality of model as you allow more variables
The best route is to use LASSO with different values of T and see which value gives you the best tradeoff
Elastic net
Almost the same as LASSO, but the constrain combination of absolute value of coefficients and their squares. Need to choose the appropriate value of T and gamma.
Ridge Regression
Take out absolute value term from ELastic Net
DOesn’t do variable selection but can lead to better models
Lasso regression vs ridge regression
The objective of the two to choose coefficients a1 and a2 that minimize the total error are exactly the same. The differences are in the restriction or the constraints on the coefficients.
Because possible coefficients for lasso are a diamond,the quadratic equation can touch a corner, where some of the coefficients are 0 so those variables are not selected.
Because possible coefficients of a ridge is a circle defined by quadratic function the quadratic error function is unlikely to touch at a corner, so all coefficients are possible and all variables are a part of the model.
BIas-variance trade off
Underfit model
High bias: Miss or minimize real effects.
LOw variance: less differentiation between predictors
Underfitting the real effects b y eliminating variance from random effects
MORE FIT
LOw Bias: Real effects are models well
High variance: More differentiation between predictions
By fitting too much, we get unwanted variance from random patterns. Better fit to real patterns. MOre fit to random patterns
What is ridge regression used for?
If there is no ridge constraint the regression solution would be the point in the middle of the ellipses. You can think of it as a 0 sized ellipse with a lowest error that can be achieved in linear regression. Adding ridge regression constraint moves the solution point inward towards the origin with all coefficients getting smaller than in linear regression solution.
The same thing happens no matter where the original point is and no matter what the size of the circle is. If the original linear regression point is outside the circle in any direction ridge regression decreases the magnitude of all of the coefficients. The tighter the constraint gets, the smaller the circle gets, the smaller the magnitude of each coefficient gets.
If the original linear regression point is already inside the circle, then adding the ridge regression constraint doesn’t change the solution. To decrease the coefficients youd have to reduce the size of the circle by changing the value of toa ridge until the linear regression point is outside the circle and then the coefficients would decrease.
What does reducing the magnitude of each coefficient do?
We reduce our fit to both real and random patterns so it gets less overfit. We reduce variance in our model.
WHat type of approach is ridge regression?
Regularization. It doesn’t select variables but it can help reduce overfitting by simplifying not the number of variables but the magnitude of each ones effect of a model. In Tao ridge is too small and reduces the coefficients too much, the model may be underfit. Have to be careful to not go too far.
If the Linear regression model is overfit, ridge regression can be used to fix the problem
Forward selection , bw elimination, setepwise:
Good for initial data analysis. Point out variables that are worth exploring further. STepwise regression is most common a s generalization of the other two. Can give a set of variables that fit more to random effects than you’d like and appear to have a better fit(rquared) than it realy has. When testing on different data they don’t perform as well
Lasso and Elastic net are slower to computer but result in better prediction.
Recommendation is to use more advanced slower models uness you are just doing intro data exploration where you can use greedy methods first and then build a more advanced model with LAsso or elastic net.
Advantages of elastic net
Variable selection benefits of LASSO
Predictive benefits of ridge regression
Disadvantages of elastic net
Arbitrary rules out some correlated variables like LASSO
Underestimates coefficients of very predictive variables like RIdge Regression