Dimensionality Reduction Flashcards

Question 1

Q

What is the aim of Dimensionality Reduction?

Answer

A

To remove noise from the data and focus on the features that are actually important and increase model efficiency.

Question 2

Q

What are the two types of Dimensionality Reduction?

Answer

A

Feature Selection
Feature Reduction

Question 3

Q

What is an example of a filter method?

Answer

A

Variance Thresholding

Question 4

Q

How does variance thresholding work?

Answer

A

Calculate the variance of each feature, then drop features with variance below some threshold. The idea is that low-variance features contain less information

Question 5

Q

What are two examples of wrapper methods?

Answer

A

Forward Search
Recursive Feature Elimination

Question 6

Q

What is a wrapper method?

Answer

A

A method that searches for an optimal feature subset tailored to a particular algorithm.

Question 7

Q

What are the main steps for a forward search?

Answer

A

Create n models with one feature each
select the best one
Create n-1 models, by adding one feature
select the best one
proceed until you have chosen m features

Question 8

Q

What are the main steps for a Recursive Feature Elimination?

Answer

A

create n-1 models, with n-1 features
Select the best one
Create n-2 models, by removing one feature
select the best one
Proceed until you have removed m features

Question 9

Q

What are three techniques for splitting data for a Decision Tree?

Answer

A

Gini impurity coefficient
Information gain
Variance reduction

Question 10

Q

What are the two types of Feature Extraction?

Answer

A

Linear Methods and Non-Linear Methods

Question 11

Q

How does Principle Component Analysis (PCA) Work?

Answer

A

It finds an orthogonal coordinate transformation such that every new coordinate is “maximally informative”

Question 12

Q

What are t-SNE and UMAP?

Answer

A

They are common methods when visualising high-dimensional dataset but they are suited to data visualisation only.

Question 13

Q

What is the process for t-SNE?

Answer

A

take the distribution of distances between the N points in the dataset. Call that D.
Scatter N points in 2 or 3 dimensions, randomly.
Move those N points around until the distribution of distances between them resembles D.

Question 14

Q

What is the advantage of UMAP over t-SNE?

Answer

A

It is only slightly different, but it runs faster and uses less memory while having no problem embedding into >3 dimensions. It can also preserve local and global structure.

Question 15

Q

What are the problems with t-SNE and UMAP?

Answer

A

Both depend a lot on their hyperparameters.
Cluster sizes and distances between clusters means nothing
x and y axes are basically impossible to interpret

Dimensionality Reduction Flashcards

(15 cards)