Dimensionality Reduction Flashcards
(15 cards)
What is the aim of Dimensionality Reduction?
To remove noise from the data and focus on the features that are actually important and increase model efficiency.
What are the two types of Dimensionality Reduction?
- Feature Selection
- Feature Reduction
What is an example of a filter method?
Variance Thresholding
How does variance thresholding work?
Calculate the variance of each feature, then drop features with variance below some threshold. The idea is that low-variance features contain less information
What are two examples of wrapper methods?
- Forward Search
- Recursive Feature Elimination
What is a wrapper method?
A method that searches for an optimal feature subset tailored to a particular algorithm.
What are the main steps for a forward search?
- Create n models with one feature each
- select the best one
- Create n-1 models, by adding one feature
- select the best one
- proceed until you have chosen m features
What are the main steps for a Recursive Feature Elimination?
- create n-1 models, with n-1 features
- Select the best one
- Create n-2 models, by removing one feature
- select the best one
- Proceed until you have removed m features
What are three techniques for splitting data for a Decision Tree?
- Gini impurity coefficient
- Information gain
- Variance reduction
What are the two types of Feature Extraction?
Linear Methods and Non-Linear Methods
How does Principle Component Analysis (PCA) Work?
It finds an orthogonal coordinate transformation such that every new coordinate is “maximally informative”
What are t-SNE and UMAP?
They are common methods when visualising high-dimensional dataset but they are suited to data visualisation only.
What is the process for t-SNE?
- take the distribution of distances between the N points in the dataset. Call that D.
- Scatter N points in 2 or 3 dimensions, randomly.
- Move those N points around ntil the distribution of distances between them resembles D.
What is the advantage of UMAP over t-SNE?
It is only slightly different, but it runs faster and uses less memory while having no problem embedding into >3 dimensions. It can also preserve local and global structure.
What are the problems with t-SNE and UMAP?
- Both depend a lot on their hyperparameters.
- Cluster sizes and distances between clusters means nothing
- x and y axes are basically impossible to interpret