Principal Component Analysis Flashcards

1
Q

Definition

A

Unsupervised learning method.

PCA aims to find a lower-dimensional representation of a dataset while retaining as much information as possible.
1.Best (minimized projection distances) lower dimension (line) approximations (map to the line) of the dataset
2. Have maximized variances (largest variance)

Produces new numerical features called PCs that maximize variance. Each one explains a distinct portion of the dataset’s variability, where the first explains the most, followed by the second, and so on. Often, the first few principal components explain most of the variability in the original variables. These principal components can be used in place of the original variables to reduce dimensionality and create a simpler model.

PCs are uncorrelated with one another, thus they can be useful to mitigate collinearity. PCA does not perform variable selection because all variables are used to construct each PC.

The original variables should be standardized (i.e. centered and scales) prior to running PCA. This prevents variables with large variances to unjustly sway the loadings.

1st PC loadings that are similar in magnitude indicate more correlated variables.

The mth PC is defined as:
z_m = Phi_1,m * x_1 +Pni_2,m*x_2 + … + Phi_g,m * x_g
-Phi is the jth loading for the mth PC. For each PC, the loadings are found by maximizing its variance (A PCs variance is also referred to as the variance it explains). Loadings that are further from 0 indicate the variables that are more informative to the PC.

Substituting the (standardized) variable values of the ith observation into the equation above produces the mth PC score for the ith observation, denoted as z_i,m.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Notes

A

Add the chosen PCs as new features AND exclude the original variables used to create the PCs from any further analysis to avoid having duplicated information.

To consider factors, binarize them such that every level has a dummy variable. (fullRank = F) However, there is no agreement on whether factors can legitimately be analyzed using PCA.

For every PC, the loadings are unique up to a sign flip. The scores are also unique up to a sign flip. Since changing the signs does not affect the magnitude of the scores, doing so will not change the predictive value of a PC, but it will affect its interpretation.

While the original variables should be scaled, there is a consistency issue since it relies on the computed standard deviations. One alternative is to use a common unit of value between the variables. This is not a unique problem to PCA, but for whenever different variable scales can have an unwanted impact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Variance Explained

A

All of the PCs will explain 100% of the dataset’s variance.

A scree plot visualizes the variance explained by each PC. It helps determine the smallest number of PCs required to explain a sufficient amount of variability. That could be indicated by the plot’s elbow - the point where the variance seems to level off.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Biplot

A

Visualizes the loadings and the scores simultaneously.

The horizontal axes are for the 1st PC; the vertical axes are for the 2nd PC.

The left and bottom axes are for the scores; the top ad right axes are for the loadings.

biplot(pca1) #visualize the loadings and scores

default is scaled to improve readability; can adjust by adding scale = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Example

A

The dataset has small dimensionality. While there are many requests within the data, as our assistant notes, there are only 9 variables. PCA is effective when there is high dimensionality (many variables) which can make univariate and bivariate data exploration and visualization techniques less effective. PCA is used to summarize high-dimensional data into fewer composite variables while retaining as much information as possible. As we do not have high dimensionality, any information loss from feature transformation will not be outweighed by improvements in model performance or capture of latent variables.

The dataset includes a significant number of factors variables, which will require conversion to numeric values prior to applying PCA. PCA attempts to maximize the variance or spread in our data distribution by linearly combining original variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Advantages and Disadvantages

A

Advantages:
–Can be used to address collinearity

Disadvantages:
–Can be unintuitive depending on chosen variables and loadings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly