encoding and multicolinearity Flashcards

Question 1

Q

What is multicollinearity?

Answer

A

When two or more variables are linearly dependent or nearly so.

Question 2

Q

Give 3 examples of multicollinearity.

Answer

A

Variables summing to a constant (e.g., percentages)
One variable is a scaled version of another
Dummy variables for all categories (without dropping one)

Question 3

Q

What are the consequences of multicollinearity?

Answer

A

Inflated variable weights
Models train on noise → overfitting
Non-unique solutions
Matrix inversion fails (breaks some algorithms)

Question 4

Q

How can scatterplots help detect multicollinearity?

Answer

A

They show when two variables are nearly linear transformations of each other or when there looks like there is a high correlation

Question 5

Q

What does a correlation matrix reveal?

Answer

A

Pairs of variables that are highly correlated, hinting at redundancy.

Question 6

Q

What is the Variance Inflation Factor (VIF)?

Answer

A

A score that quantifies how much a variable’s variance is inflated due to multicollinearity.

Question 7

Q

How is VIF interpreted?

Answer

A

1: No multicollinearity
1–5: Moderate
5: High (concerning)

Question 8

Q

How can we fix multicollinearity by dropping a variable?

Answer

A

Remove a redundant variable if it’s clearly correlated with others.

Pro: Interpretability preserved. Con: May be hard to spot.

Question 9

Q

How does PCA help fix multicollinearity?

Answer

A

PCA transforms data into uncorrelated components.

Pro: No manual variable selection. Con: PCs may be hard to interpret.

Question 10

Q

What’s another method to fix multicollinearity?

Answer

A

Group related variables (e.g., take their mean or sum).

Pro: Retains info. Con: Doesn’t fix collinearity if they add to a constant.

Question 11

Q

What are the assumptions of the T² Test?

Answer

A

Multivariate normality, independence, homogeneity of variance, continuous vars, no outliers, no multicollinearity.

Question 12

Q

What are the assumptions of MANOVA?

Answer

A

Multivariate normality, Independent observations, Homogeneity of variance, continuous variables, no outliers.

Question 13

Q

What are the assumptions of PCA?

Answer

A

Linearity, continuous data, scaled data, no outliers; correlation helps.

Question 14

Q

What are the assumptions of Factor Analysis (FA)?

Answer

A

linearity, continuous variables, data is scaled, no outliers. (works best when there is some correlation). + assumes latent factors exist.

Question 15

Q

What are the assumptions of LDA?

Answer

A

Linearity, continuous vars, independence, homogeneity, no outliers or multicollinearity.

Question 16

Q

What are the assumptions of Clustering?

Answer

Study These Flashcards

A

Scaled data, clusters exist, no outliers; shape, linkage, distance, and k matter.

Question 17

Q

What are the assumptions of Canonical Correlation Analysis (CCA)?

Answer

Study These Flashcards

A

Linearity, continuous vars, homogeneity of variance, no outliers or multicollinearity.

Question 18

Q

Can we use encoded categorical data in clustering?

Answer

Study These Flashcards

A

Yes, but we should use distance metrics appropriate for categorical data.

Question 19

Q

What’s the main caution with encoding?

Answer

Study These Flashcards

A

Be careful mixing categorical and numerical data — they behave very differently in multivariate models.

encoding and multicolinearity Flashcards

(19 cards)