Model Selection Flashcards by Nicole Mora

What is the traditional strategy?

No model selection. You run one model, and assume every continuous variable has a linear relationship. Include all interactions

How well did you know this?

Not at all

Perfectly

Problems with traditional strategy?

No multicollinearity assessed, overfiting data and not all relationships are linear.

How well did you know this?

Not at all

Perfectly

Historical strategy

Based on past models that ran just fine. One advantage is that with this, you are able to compare data from past models because you include same predictors and same variables

How well did you know this?

Not at all

Perfectly

Problems with historical designs

You may ignore variables that were not initially considered. Design may be insufficient eventually

How well did you know this?

Not at all

Perfectly

The exploratory approach

Here a wide range of models are run and then usually the one that worked is the one reported.

How well did you know this?

Not at all

Perfectly

Problems with the exploratory strategy?

A lot of experimenter degrees of freedom. Results could be too flexible and could lead to replication problems. Use a more conservative model selection approach (AIC and BIC are always safe bets)

How well did you know this?

Not at all

Perfectly

Don’t do this in calling analyses

Call exploratory analysis confirmatory

How well did you know this?

Not at all

Perfectly

Theoretically driven approach

Only a limited number of models are run here, 3 tops has to be theoretically driven. Is very systematic and highly localized. Less prone to overfitting

How well did you know this?

Not at all

Perfectly

Problems with theoretically driven?

Can miss the analysis of variables already collected. Is okay to include both models just report everything. But for confirming that you should run an additional experiment and run it as confirmatory.

How well did you know this?

Not at all

Perfectly

What is the mixed strategy?

Report which decisions models were exploratory, theoretically driven, etc. in its appropriate sections. Can segment the analysis into different parts, lessens the overfitting issue

How well did you know this?

Not at all

Perfectly

Disadvantages of mixed strategy?

Researcher degrees of freedom

How well did you know this?

Not at all

Perfectly

P- focused strategy

Choose the model that has your key variables significant.

How well did you know this?

Not at all

Perfectly

Interocular Test

You look at the results, and it hits you between the eyes. Plot your data,a which says a lot. Statistics are more like an afterthought. Worry if the graph shows no differences and the analysis is telling you otherwise

How well did you know this?

Not at all

Perfectly

What happens with the R square

The proportion of variance accounted for. To calculate this you have the mean somewhere in the formula. Mean is a measure of central tendency in a NORMAL DISTRIBUTION. Which is why in models where this is not met, R square is not a good model fit indicator

How well did you know this?

Not at all

Perfectly

RMSE

Is the square root of the variance squared or the ss formula. The squared errors are averaged. In other words this is a measure of the average distance any point is from the mean. But this does not by any means deal with complexity

How well did you know this?

Not at all

Perfectly

Mallow CP

Study These Flashcards

In JMP, when you try to do a stepwise regression. Another inde,x butit is not common

The AIC and BIC

Study These Flashcards

The Akaike and Bayesian Information criteria. One penalizes for complexity. BIC favors simpler models over complex models

Likelihood Ratios

Study These Flashcards

When you are comparing one model to the other we talk about how much likely one model is to fit the data versus another model. It gives a ratio because… is 10 times more likely, is 35 times more likely.

AIC / BIC / Likelihood ratios

Study These Flashcards

Since AIC and BIC are based on Log likelihoods, you can include them in the model fit inspection, or you can transform AICs and BICs to the log likelihood scale to make them more interpretable

Cross validation

Study These Flashcards

Uses two types of data. Sample data to develop a model. Test the dataset to see if that model also applies to more data. This is an empirical way to determine if the model is going to perform well out of that sample. If it does, it cross-validates.

LOO

Study These Flashcards

Leave-one-out cross-validation. It holds one line out of your data out and after the model is done and created, see how well it does with that one out, and repeats for every single row in your dataset

Bootstrapping

Study These Flashcards

Sampling with replacement, and treat the distribution as the population. It gives you an empirically estimated distribution Some people use it to make up for a small sample size.

When do you use bootstrapping

Study These Flashcards

When you have bad, very bad distributional properties that there is no statistical test to determine the differences. This happens in something called interportal ranges.

Where could you use this?

Study These Flashcards

In multilevel repeated meassures designs when you have imbalances but slight ones like in each condition some values were missing but does not apply if the whole level is missing

Another application of Bootstrapping

You can bootstrap your R square values which is great because now you add the uncertainty estimate of the R square to make more informed decisions

Weakness of R-squared

It isn’t really appropriate when the errors aren’t normally distributed. R-squared and RMSE do not control for model flexibility

Weaknesses of AIC and BIC

Can only compare models using the same scale in both models (log-log) Very sensitive to missing data for comparison, be careful if a variable has missing data, as it makes the model no longer comparable. You cannot compare a not-transformed vs a transformed model

Overfitting

Too small sample sizes. Robustness over p-values.

Bias-variance tradeoff

With cross valitadion we are stimating how big the gap is. Too small an be underfitting, too large can be overfitting

What is bias?

Model does not fit the training data... makes sense

What is variance?

The size of the difference between testing and training systems

Ideal model

The one that minimizes both variance and bias

Cross validation vs BIC, AIC

When you don't have enough data you can opt for a AIC/BIC

Statistical comparissons

Statistical comparisson of both models. chi square the most common and tests for fit as a function of increased flexibility. But models have to be nested within the other

What does nested mean?

A model is nested when it includes a subset of the predictors that is in the model one

Check model is a lifesaver

Because when AICs are not okay because the two scales of y are different, when your models are not nested, when you don't have enough to cross validate... Check model is enough to compare between models

Model Selection Flashcards

(36 cards)