Data Pre processing Flashcards by Prince Ibekwe

What is Data Preprocessing?

A step in data science that transforms raw data into a format that can be understood and analyzed by computers and machine learning.

How well did you know this?

Not at all

Perfectly

Why is Data Preprocessing important?

Real-world data is often dirty, incomplete, noisy, or inconsistent, and preprocessing helps clean and prepare it for analysis.

How well did you know this?

Not at all

Perfectly

What are the major tasks involved in Data Preprocessing?

Data cleaning, data integration, data reduction, and data transformation.

How well did you know this?

Not at all

Perfectly

What is Data Cleaning?

The process of handling missing values, smoothing noisy data, identifying/removing outliers, and resolving inconsistencies.

How well did you know this?

Not at all

Perfectly

What are the common issues in real-world data?

Missing values, noise, errors, duplicate records, and inconsistent data formats.

How well did you know this?

Not at all

Perfectly

What are some methods to handle missing data?

Removal, interpolation, replacement with mean or most common value.

How well did you know this?

Not at all

Perfectly

What is noisy data?

Random errors or variance in a dataset that can affect model performance.

How well did you know this?

Not at all

Perfectly

What are common techniques for handling noisy data?

Binning, clustering, and combined computer-human inspection.

How well did you know this?

Not at all

Perfectly

What is Data Integration?

The process of combining data from multiple sources to create a unified dataset.

How well did you know this?

Not at all

Perfectly

What challenges arise in Data Integration?

Schema mismatches, redundancy, and inconsistencies across different sources.

How well did you know this?

Not at all

Perfectly

What is Data Reduction?

Reducing the amount of data while maintaining its integrity to improve efficiency.

How well did you know this?

Not at all

Perfectly

What are the types of Data Reduction?

Numerosity reduction and dimensionality reduction.

How well did you know this?

Not at all

Perfectly

What is Numerosity Reduction?

Reducing the number of data objects (rows) in a dataset.

How well did you know this?

Not at all

Perfectly

What is Dimensionality Reduction?

Reducing the number of features (columns) while preserving data structure.

How well did you know this?

Not at all

Perfectly

What are common Numerosity Reduction methods?

Random sampling, stratified sampling, and random over/undersampling.

How well did you know this?

Not at all

Perfectly

What is Random Sampling?

Randomly selecting a subset of data points to reduce computational cost.

How well did you know this?

Not at all

Perfectly

What is Stratified Sampling?

Selecting a sample that maintains the original proportions of different groups in the dataset.

How well did you know this?

Not at all

Perfectly

What is Random Over/Undersampling?

Study These Flashcards

Altering the sample proportions to balance class distributions for classification tasks.

What are common Dimensionality Reduction methods?

Study These Flashcards

Linear Regression, Decision Trees, Random Forest, PCA, Functional Data Analysis (FDA).

How does Linear Regression help in Dimensionality Reduction?

Study These Flashcards

By identifying independent variables with weak predictive power and eliminating them.

What is PCA (Principal Component Analysis)?

Study These Flashcards

A technique that transforms data into new components, capturing the most variance while reducing dimensionality.

What is the difference between Supervised and Unsupervised Dimensionality Reduction?

Study These Flashcards

Supervised focuses on improving predictions, while Unsupervised reduces data size without prediction considerations.

What is Data Transformation?

Study These Flashcards

Modifying the dataset to ensure it is suitable for analysis and improves model performance.

What are the key types of Data Transformation?

Study These Flashcards

Normalization and Standardization.

What is Normalization?

Scaling data so that all values fall within a specific range, usually [0,1].

What is Standardization?

Transforming data to have a mean of 0 and standard deviation of 1.

When should Normalization be used?

When distance-based algorithms like K-Means and KNN are used.

When should Standardization be used?

When maintaining data variance is important for analysis.

What is Clustering in Data Preprocessing?

A method to identify and group similar data points to detect outliers or patterns.

What are the types of Clustering?

Partitioning methods (K-Means), Hierarchical methods, and Density-based methods (DBSCAN).

What is K-Means Clustering?

An iterative clustering method that partitions data into K groups based on distance to centroids.

What is the Elbow Method?

A technique to determine the optimal number of clusters in K-Means by analyzing inertia.

What is DBSCAN Clustering?

A density-based clustering method that identifies clusters based on regions of high point density.

What is the Silhouette Score?

A measure of how well a data point fits within its assigned cluster, ranging from -1 to 1.

What is Inertia in Clustering?

The sum of squared distances between data points and their respective cluster centroids.

What is Hierarchical Clustering?

A method that builds a hierarchy of clusters, merging them step by step.

What is the difference between Agglomerative and Divisive Clustering?

Agglomerative starts with individual points and merges clusters, while Divisive starts with one cluster and splits it.

What is an Outlier?

A data point significantly different from the rest of the dataset.

What are common Outlier Detection methods?

Statistical methods, clustering-based methods, and density-based methods.

What is an example of Outlier Handling using Clustering?

Using DBSCAN to separate noise points from valid clusters.

What is Feature Engineering?

Creating new features from raw data to improve predictive performance.

What is Feature Selection?

Choosing the most relevant features to improve model accuracy and reduce complexity.

Why is Data Preprocessing crucial for machine learning?

Ensures high-quality input data, improving model accuracy and efficiency.

What happens if data preprocessing is not done correctly?

Models may learn misleading patterns, perform poorly, or be biased.

Data Pre processing Flashcards

(44 cards)