Session 7: Regularised Regression Flashcards

1
Q

What does the standard statistical approach assume?

A

A parametric model such as the linear model

𝑓(𝑿𝑖 )= 𝛽0+𝛽1 𝒙1+…+𝛽𝑝 𝒙𝑝

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

If we hypothesize a linear model for the response y

𝑦= 𝛽0+𝛽1 𝒙1+…+𝛽𝑝𝒙𝑝 + 𝜺

where 𝜺 =N(0, Οƒ2)

To estimate unknown parameter 𝛽_0,𝛽_1,…𝛽_𝑝 we what?

A

Ordinary least square (OLS) methods:

We choose 𝛽_0 , 𝛽_1…, 𝛽_𝑝 to minimize the residual sum of squares between observed and predicted responses of the (same) data set.

Statistical inference assuming normal distributing of the error πœ– allows constructing confidence intervals around parameters and performing statistical tests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are problems with the OLS estimator?

A

There are situations when the OLS estimator does not work well:

When the independent variables are highly correlated- We get unstable, highly variable estimates

The number of parameters relative to the sample size is large - danger of overfitting

We optimise for an unbiased estimator of mean response E(y) but not for predicting new unseen individual cases yi - Often not the optimal model for prediction models, very often the optimal model for explanatory research is not optimal for prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe a model that under-fits

A

Performs poorly on the training data.

It did not capture the relationship between the predictors X and outcome Y. Performance on test data will be even worse. The model is biased!

No curvilinear relationship thus we assume that it explains least amount of variance in our training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe a model that over-fits

A

A model overfittedour training data when the model performs well on the training data but poorly on the new test data. This is because the model fitted noise in the training data.

The model memorizes the exact pattern of data it has seen and is unable to generalize to unseen examples, because the pattern will not reappear.

The model variance is high.

We want a model to work well on the future data, not on the training data!

Over-fitted model explains variance best – very small difference between observed and predicted values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe a balanced model

A

A model is balanced (β€œjust right”) when it captures the true pattern and therefore predicts well new unseen cases.

Balanced dataset which describes curvilinear relationship should explain variance better than under-fitted model but not over-fitted model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Under-fitting is a bigger problem than over-fitting

True or false

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

If we assume that there is a relationship between outcome Y (depression score) and at least one of the p independent variables X (clinical and demographic characteristics)

How can we model this?

A

π‘Œ=𝑓(𝐗)+πœ–
where 𝑓( ) is an unknown function
and πœ– random error with mean 0 and variance 0 and o2

Then the expected mean squared prediction error is:

E(MSE)= (π‘€π‘œπ‘‘π‘’π‘™ π΅π‘–π‘Žπ‘ )2 +π‘€π‘œπ‘‘π‘’π‘™ π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ + Οƒ2 (noise that we cannot explain)

Want model with smallest prediction error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does the expected mean squared prediction error mean?

E(MSE)= (π‘€π‘œπ‘‘π‘’π‘™ π΅π‘–π‘Žπ‘ )2 +π‘€π‘œπ‘‘π‘’π‘™ π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ + Οƒ2

A

Bias is result of misspecifying the model f

Reflects how close the functional form of the model is to the true model

High bias results in underfitting

Model estimation variance is result of using sample to estimate f(x)

Quantifies the dependency of the data points used to build the model.

High variance = small changes in data change model parameter estimates substantial

High variance results in overfitting

o𝟐 is irreducible error even if the model f is correctly specified and estimated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do explanatory and prediction modelling optimize MSE differently?

A

In explanatory modelling we try to minimize MSE in our training dataset and usually MSE predicted new cases is small than MSE in our training sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can we improve prediction accuracy?

A

By reducing the variability of the regression estimates (Model variance) at the cost of increased bias (Model bias) – because OLS is unbiased thereby model bias is 0 and the MSE is caused by model variance plus error

Thus, can confine model that is biased but has smaller variance and this would improve prediction accuracy

The best explanatory model is often different from the best prediction model! – as we optimise a different way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

If we fitted 90 parameters to a data set of 100 persons, model explains almost 100% of the training data, absolute values of coefficients are too large and we have type I error (too many significant parameters). What does this mean?

A

Our model overfits the training data and is unlikely to predict well a new random sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Our OLS model is unbiased (with increasing sample size parameters would move towards the true value)

The absolute values of the parameters are too large (many well above 0).

How could we improve the model?

A

We need to shrink the regression coefficients somehow

  • Shrinkage (or regularization) helps preventing linear models from overfitting training data by shrinking the coefficients towards 0.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What do shrinkage of regularization methods perform?

A

Linear regression while regularizing or shrinking the estimated coefficients towards 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why does shrinkage help with overfitting?

A

It introduces bias but may decrease the variance of the estimates. If the latter effect is larger, we would decrease the test error!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

By sacrificing unbiasedness, we can reduce…

A

the variance to make the overall MSE lower

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

With regularised methods what do we introduce?

A

Some bias and on average we are bit away from true parameter but there is very little model variance. All data close to true value and model will predict well. Thus on average closer to true values than ordinary least square method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What did Houwellingen and le Cessie (1990) develop?

A

Heuristic shrinkage estimate:

𝛾̂=(π‘šπ‘œπ‘‘π‘’π‘™ πœ’2βˆ’π‘)/(π‘šπ‘œπ‘‘π‘’π‘™ πœ’2 )
where
p is the total degree of freedoms of the predictors (number of parameter -1) and
πœ’2 the likelihood ratio statistics for testing the overall effect of all predictors.

For linear model with an intercept of b0 and coefficients 𝛽̂𝑗(j=1,2…p) , the shrunken estimates are easily estimated:

π‘ β„Žπ‘Ÿπ‘’π‘›π‘˜π‘’π‘› 𝛽^𝑗=y^(𝛽̂𝑗) – gamma hat x estimated regression coefficient

π‘ β„Žπ‘Ÿπ‘’π‘›π‘˜π‘’π‘› 𝛽0=(1βˆ’π›ΎΜ‚ ) π‘ŒΜ…+𝛾̂ (𝛽0)) - Intercept is 1 – shrinkage estimate x mean of all Y’s x mean of all observed outcomes plus shrinkage estimate x intercept

The model with shrunken regression coefficients predict on average better new unseen cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Evaluate Heuristic shrinkage estimate

A

Works reasonably well in generalized linear models (i.e. Steyerberg et al. 2001)

p is the number of candidate predicators if variable selection is used! – If remove some variables because they are not correlated with outcome, must use variables started with not the final number of variables in model

It would be better to integrate the shrinkage in the model building process.

Find a shrinkage on the parameters that optimizes the prediction of unseen cases
- A general shrinkage procedure are penalized (or regularized) regression methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a modern approach to prediction modelling?

A

Regularized or penalized methods that can be applied to both large data sets (bioinformatics, neuroimaging, wearable data) and small data sets with a large number of variables (RCTs , experimental studies, cohort studies).

Not really new:
Ridge regression: Arthur Hoerl and Robert Kennard (1970)
limited computer power restricted their use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the basic principle of penalised methods?

A

To improve prediction accuracy by reducing the variability of the regression estimates at the cost of increased bias (shrinkage)

22
Q

What are advantages to using penalised methods?

A
  1. Also allows automatic variable selection by shrinkage:

As the coefficients of the weaker predictors are shrunk towards zero – removed from regression model

This is very useful for high dimensional data (p&raquo_space; n)

  1. Can also effectively deal with with ill-conditioned regression problems

Multi-collinearity (and redundancy - too many variables in dataset and number of variants is close to sample size )

number of variables (p) is close to the sample size (n)

23
Q
  1. What happen when a model overfits data?
  2. How can this be remedied?
A
  1. Standard estimates of regression coefficients become inflated or unstable
  2. Estimates can be stabilised (regularised) by adding a penalty to the estimating equations

For linear regressions, the penalty is added to the residual sum of squared errors (RSS)

𝑅𝑆𝑆(πœ†)= βˆ‘(π‘¦π‘–βˆ’π‘¦Μ‚π‘– )2 +πœ†π‘“(𝛽)

In OLS we try to minimize RSS to find best optimal unbiased estimate of our regression coefficient

In penalized regression we add a penalty term called lambda x function of regression coefficient

Larger lambda the larger the penalty in our residual sums of squared error

24
Q

In principle with the right choice of lambda what can we get?

A

An estimator with a better MSE

Estimate is not unbiased but what we pay for in bias we make up for in variance

By sacrificing unbiasedness, we can reduce the variance to make the overall MSE lower

25
Q

We try to find a lambda(penalty) that does what?

A

Minimises error of unseen cases

If lambda is 0 then we have an OLS method and we have a bias of 0 if increase lambda then bias becomes much larger and MSE becomes larger but on other hand if our ordinary squared method or variance might be large but if increase lambda is becomes smaller

26
Q

What are 3 commonly used functions?

A
  1. Ridge penalty

𝑓(𝛽)= βˆ‘π›½^2

sum of squared coefficients (βˆ‘π›½2) forms the penalty
Also called L2 norm as squared

  1. LASSO (Least Absolute Shrinkage and Selection Operator):
    𝑓(𝛽)= βˆ‘|𝛽|

sum of absolute coefficients (οƒ₯||) forms the penalty

Also called L1 norm as its beta to the power of one

  1. Elastic net
    – a combination of L1 and L2 norm r egularization
27
Q

What is commonly used to deal with ill conditioned regression problems such as multi-collinearity (high correlation between predictor variables) and
number of variables (p) is close to the sample size (n)?

A

Ridge regression

28
Q

How is ridge regression of Ξ² obtained?

A

By minimising root sum of squares plus our penalty which is lambda times sum of squared regression coefficients, this is penalty term which we add to our root RSS and we now call this RSS

The parameter Ξ» scales the norm - controls the amount of penalty

29
Q

What is one of the important problems in applying ridge regression?

A

To choose the right value of Ξ»

30
Q

What does LASSO (Least Absolute Shrinkage and Selection Operator) a promising technique for?

A

Variable selection

Finding a small subset of most predictive variables in a high dimensional dataset is an interesting and important problem

31
Q

How does LASSO tend to deal with overfitting?

A

Tends to assign zero coefficients to most irrelevant or redundant variables - This is also called a sparse solution

32
Q

How are LASSO estimates obtained?

A

Minimised RSS plus lasso penalty term which is lambda times absolute value of regression parameters

This is called L1 penalty/norm

Lasso penalty involves absolute values of regression parameters and not sum of the squared values like in ridge regression

Need to find best lambda to minimise penalised root mean squared error

Similar to ridge regression, the penalty parameter (Ξ») controls the amount of penalty (user customisable)

33
Q

If compute lasso or ridge data must be…

A

Standardised so those with large range do not dominate model selection:

Different units (m versus km) would result in different solutions

This is automatically done in most software packages

R packages such as β€œGlmnet” back transforms final regression coefficient on original scale!

34
Q

What is the z-transformation formula?

A

Linear transformation of values to common mean of zero and stand deviation of 1:
zi = (π‘₯π‘–βˆ’π‘₯Μ…)/𝑠

with
𝑧𝑖=𝑧 π‘‘π‘Ÿπ‘Žπ‘›π‘ π‘“π‘œπ‘Ÿπ‘še𝑑 π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘Žπ‘‘π‘–π‘œπ‘›
π‘œπ‘“ π‘π‘Žπ‘ π‘’ 𝑖 π‘œπ‘“ π‘‘β„Žπ‘’ π‘ π‘Žπ‘šπ‘π‘™π‘’

π‘₯Μ…=π‘ π‘Žπ‘šπ‘π‘™π‘’ π‘šπ‘’π‘Žπ‘›
π‘₯𝑖=π‘œπ‘Ÿπ‘”π‘–π‘›π‘Žπ‘™ π‘£π‘Žπ‘™π‘’π‘’ π‘œπ‘“ π‘π‘Žπ‘ π‘’ 𝑖

𝑠=π‘ π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ π‘‘π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› π‘œπ‘“ π‘ π‘Žπ‘šπ‘π‘™π‘’

35
Q

The z-transformation changes the form of the distribution, it only adjusts the mean and the standard deviation!

True or false

A

FALSE

z-transformation does not change the form of the distribution, it only adjusts the mean and the standard deviation!

36
Q

How do we select lambda?

A

Goal is to evaluate the model in terms of it’s ability to predict future observations:

The model need to be evaluated on a dataset that was not used to build the model (test sets)

We assess different lambdas and choose the one which predicts best unseen cases using cross-validation

This best lambda is used fit the model using the complete data set

Take average MSE, and calculate mse of 100 different lambdas of different strength and pick lamda with smallest MSE

We pick the lambda which best predicts unseen cases (= smallest Mean squared error, MSE)

37
Q

What is the performance of our model function measured by?

A

A loss function for penalizing error in prediction.

38
Q

What is a loss of function measured by?

A

How good a prediction model does in terms of being able to predict the expected outcome..

39
Q

What is a popular loss of function?

A

MSE loss function

We decide to choose the function f(x) which minimizes the expected loss or here the expected mean squared prediction error (MSE).

The expected MSE can be estimated by cross-validation or bootstrapping methods - We use the same methodology as for internal validation!

40
Q

If build optimal predictive models any sensible subset selection algorithm can be combined with what?

A

Cross-validation to build a good prediction model

The idea is to build a large number of alternative models (of varying complexities) and evaluate the predictive performance using cross-validation to select the best model

In regularized regression we compare models with different lambdas!

41
Q

Using hold-out data for prediction accuracy estimation involves what?

A

Using CV to select optimal πœ† selects the best set of predictors of unseen cases.

However: Prediction accuracy measures are over-optimistic estimates for accuracy of future sample: CV test data were used to select our model!

42
Q

What is ridge not useful for?

A

Parsimonious model selection

43
Q

Ridge penalty function is very flat near the zero values of Ξ², what does this mean?

A

Does not encourage the Ξ² coefficients to be exactly zero
Not good for variable selection
Not good for sparse problems

Alternative penalised methods (e.g., LASSO, see next) is a better option for variable selection

44
Q

What is lambda.1se?

A

This is a slightly stronger penalty than the minimum lambda and lies within one standard error of the optimal value of lambda.

The purpose of regularization is often to balance accuracy and simplicity: We want a model with the smallest number of predictors that also gives a good accuracy.

Setting lambda = lambda.1se results in a simpler model compared to lambda.min (less variables are selected), but the model might be a little bit less accurate than the one obtained with min.lambda.

Research suggest that this lambda sometimes predicts better in external data sets and selects less false positive predictors.

45
Q

When should you compute OLS?

A

If you have large sample sizes with a relative small number of variables of likely predictors (theory-driven)

46
Q

When should you compute Ridge?

A

If you expect many small effect sizes and predictors are likely true ones (you want to keep all variables in the model).

47
Q

When should you compute Lasso?

A

If you have a few stronger predictors among a large number of likely weak predictors or noise variables.

48
Q

What is not very meaningful in penalised regressions and why?

A

Statistical Inference of regression coefficients

This is because the penalised estimates are biased towards zero

Standard Error (SE) of penalised coefficients give only partial information of the precision

SE ignores the inaccuracy caused by bias

Software packages do not supply standard errors (SE), confidence interval (CI), or p-values for penalised regression. Internal validation is our β€œtest”

Major aim in penalised regression is to build a prediction model/variable selection rather than performing statistical inference

49
Q

Regularized or penalized regressions are extensions of the linear model.

True or false

A

True

50
Q

Regularized or penalized regressions seek to do what?

A

Minimise the sum of the squared error (or MSE) of the model on the training data but also to try avoiding over-fitting by reducing the complexity of the model at the cost of some bias

This is done by shrinking the regression coefficients

51
Q

What are two popular examples of Regularized or penalized regressions?

A

Ridge Regression: where Ordinary Least Squares is modified to also minimize the squared absolute sum of the coefficients (called L2 regularization).

Lasso Regression: where OLS is modified to also minimize the absolute sum of the coefficients (L1 regularization).
Unlike Ridge, Lasso regression performs variable selection by shrinking some coefficients to 0