encoding and multicolinearity Flashcards
(19 cards)
What is multicollinearity?
When two or more variables are linearly dependent or nearly so.
Give 3 examples of multicollinearity.
Variables summing to a constant (e.g., percentages)
One variable is a scaled version of another
Dummy variables for all categories (without dropping one)
What are the consequences of multicollinearity?
Inflated variable weights
Models train on noise → overfitting
Non-unique solutions
Matrix inversion fails (breaks some algorithms)
How can scatterplots help detect multicollinearity?
They show when two variables are nearly linear transformations of each other or when there looks like there is a high correlation
What does a correlation matrix reveal?
Pairs of variables that are highly correlated, hinting at redundancy.
What is the Variance Inflation Factor (VIF)?
A score that quantifies how much a variable’s variance is inflated due to multicollinearity.
How is VIF interpreted?
1: No multicollinearity
1–5: Moderate
5: High (concerning)
How can we fix multicollinearity by dropping a variable?
Remove a redundant variable if it’s clearly correlated with others.
Pro: Interpretability preserved. Con: May be hard to spot.
How does PCA help fix multicollinearity?
PCA transforms data into uncorrelated components.
Pro: No manual variable selection. Con: PCs may be hard to interpret.
What’s another method to fix multicollinearity?
Group related variables (e.g., take their mean or sum).
Pro: Retains info. Con: Doesn’t fix collinearity if they add to a constant.
What are the assumptions of the T² Test?
Multivariate normality, independence, homogeneity of variance, continuous vars, no outliers, no multicollinearity.
What are the assumptions of MANOVA?
Multivariate normality, Independent observations, Homogeneity of variance, continuous variables, no outliers.
What are the assumptions of PCA?
Linearity, continuous data, scaled data, no outliers; correlation helps.
What are the assumptions of Factor Analysis (FA)?
linearity, continuous variables, data is scaled, no outliers. (works best when there is some correlation). + assumes latent factors exist.
What are the assumptions of LDA?
Linearity, continuous vars, independence, homogeneity, no outliers or multicollinearity.
What are the assumptions of Clustering?
Scaled data, clusters exist, no outliers; shape, linkage, distance, and k matter.
What are the assumptions of Canonical Correlation Analysis (CCA)?
Linearity, continuous vars, homogeneity of variance, no outliers or multicollinearity.
Can we use encoded categorical data in clustering?
Yes, but we should use distance metrics appropriate for categorical data.
What’s the main caution with encoding?
Be careful mixing categorical and numerical data — they behave very differently in multivariate models.