Data Pre processing Flashcards

1
Q

What is Data Preprocessing?

A

A step in data science that transforms raw data into a format that can be understood and analyzed by computers and machine learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is Data Preprocessing important?

A

Real-world data is often dirty, incomplete, noisy, or inconsistent, and preprocessing helps clean and prepare it for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the major tasks involved in Data Preprocessing?

A

Data cleaning, data integration, data reduction, and data transformation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Data Cleaning?

A

The process of handling missing values, smoothing noisy data, identifying/removing outliers, and resolving inconsistencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the common issues in real-world data?

A

Missing values, noise, errors, duplicate records, and inconsistent data formats.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some methods to handle missing data?

A

Removal, interpolation, replacement with mean or most common value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is noisy data?

A

Random errors or variance in a dataset that can affect model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are common techniques for handling noisy data?

A

Binning, clustering, and combined computer-human inspection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Data Integration?

A

The process of combining data from multiple sources to create a unified dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What challenges arise in Data Integration?

A

Schema mismatches, redundancy, and inconsistencies across different sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Data Reduction?

A

Reducing the amount of data while maintaining its integrity to improve efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the types of Data Reduction?

A

Numerosity reduction and dimensionality reduction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Numerosity Reduction?

A

Reducing the number of data objects (rows) in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Dimensionality Reduction?

A

Reducing the number of features (columns) while preserving data structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are common Numerosity Reduction methods?

A

Random sampling, stratified sampling, and random over/undersampling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Random Sampling?

A

Randomly selecting a subset of data points to reduce computational cost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Stratified Sampling?

A

Selecting a sample that maintains the original proportions of different groups in the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is Random Over/Undersampling?

A

Altering the sample proportions to balance class distributions for classification tasks.

19
Q

What are common Dimensionality Reduction methods?

A

Linear Regression, Decision Trees, Random Forest, PCA, Functional Data Analysis (FDA).

20
Q

How does Linear Regression help in Dimensionality Reduction?

A

By identifying independent variables with weak predictive power and eliminating them.

21
Q

What is PCA (Principal Component Analysis)?

A

A technique that transforms data into new components, capturing the most variance while reducing dimensionality.

22
Q

What is the difference between Supervised and Unsupervised Dimensionality Reduction?

A

Supervised focuses on improving predictions, while Unsupervised reduces data size without prediction considerations.

23
Q

What is Data Transformation?

A

Modifying the dataset to ensure it is suitable for analysis and improves model performance.

24
Q

What are the key types of Data Transformation?

A

Normalization and Standardization.

25
What is Normalization?
Scaling data so that all values fall within a specific range, usually [0,1].
26
What is Standardization?
Transforming data to have a mean of 0 and standard deviation of 1.
27
When should Normalization be used?
When distance-based algorithms like K-Means and KNN are used.
28
When should Standardization be used?
When maintaining data variance is important for analysis.
29
What is Clustering in Data Preprocessing?
A method to identify and group similar data points to detect outliers or patterns.
30
What are the types of Clustering?
Partitioning methods (K-Means), Hierarchical methods, and Density-based methods (DBSCAN).
31
What is K-Means Clustering?
An iterative clustering method that partitions data into K groups based on distance to centroids.
32
What is the Elbow Method?
A technique to determine the optimal number of clusters in K-Means by analyzing inertia.
33
What is DBSCAN Clustering?
A density-based clustering method that identifies clusters based on regions of high point density.
34
What is the Silhouette Score?
A measure of how well a data point fits within its assigned cluster, ranging from -1 to 1.
35
What is Inertia in Clustering?
The sum of squared distances between data points and their respective cluster centroids.
36
What is Hierarchical Clustering?
A method that builds a hierarchy of clusters, merging them step by step.
37
What is the difference between Agglomerative and Divisive Clustering?
Agglomerative starts with individual points and merges clusters, while Divisive starts with one cluster and splits it.
38
What is an Outlier?
A data point significantly different from the rest of the dataset.
39
What are common Outlier Detection methods?
Statistical methods, clustering-based methods, and density-based methods.
40
What is an example of Outlier Handling using Clustering?
Using DBSCAN to separate noise points from valid clusters.
41
What is Feature Engineering?
Creating new features from raw data to improve predictive performance.
42
What is Feature Selection?
Choosing the most relevant features to improve model accuracy and reduce complexity.
43
Why is Data Preprocessing crucial for machine learning?
Ensures high-quality input data, improving model accuracy and efficiency.
44
What happens if data preprocessing is not done correctly?
Models may learn misleading patterns, perform poorly, or be biased.