encoding and multicolinearity Flashcards

(19 cards)

1
Q

What is multicollinearity?

A

When two or more variables are linearly dependent or nearly so.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give 3 examples of multicollinearity.

A

Variables summing to a constant (e.g., percentages)
One variable is a scaled version of another
Dummy variables for all categories (without dropping one)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the consequences of multicollinearity?

A

Inflated variable weights
Models train on noise → overfitting
Non-unique solutions
Matrix inversion fails (breaks some algorithms)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can scatterplots help detect multicollinearity?

A

They show when two variables are nearly linear transformations of each other or when there looks like there is a high correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does a correlation matrix reveal?

A

Pairs of variables that are highly correlated, hinting at redundancy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the Variance Inflation Factor (VIF)?

A

A score that quantifies how much a variable’s variance is inflated due to multicollinearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is VIF interpreted?

A

1: No multicollinearity
1–5: Moderate
5: High (concerning)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can we fix multicollinearity by dropping a variable?

A

Remove a redundant variable if it’s clearly correlated with others.

Pro: Interpretability preserved. Con: May be hard to spot.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does PCA help fix multicollinearity?

A

PCA transforms data into uncorrelated components.

Pro: No manual variable selection. Con: PCs may be hard to interpret.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What’s another method to fix multicollinearity?

A

Group related variables (e.g., take their mean or sum).

Pro: Retains info. Con: Doesn’t fix collinearity if they add to a constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the assumptions of the T² Test?

A

Multivariate normality, independence, homogeneity of variance, continuous vars, no outliers, no multicollinearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the assumptions of MANOVA?

A

Multivariate normality, Independent observations, Homogeneity of variance, continuous variables, no outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the assumptions of PCA?

A

Linearity, continuous data, scaled data, no outliers; correlation helps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the assumptions of Factor Analysis (FA)?

A

linearity, continuous variables, data is scaled, no outliers. (works best when there is some correlation). + assumes latent factors exist.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the assumptions of LDA?

A

Linearity, continuous vars, independence, homogeneity, no outliers or multicollinearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the assumptions of Clustering?

A

Scaled data, clusters exist, no outliers; shape, linkage, distance, and k matter.

17
Q

What are the assumptions of Canonical Correlation Analysis (CCA)?

A

Linearity, continuous vars, homogeneity of variance, no outliers or multicollinearity.

18
Q

Can we use encoded categorical data in clustering?

A

Yes, but we should use distance metrics appropriate for categorical data.

19
Q

What’s the main caution with encoding?

A

Be careful mixing categorical and numerical data — they behave very differently in multivariate models.