Data Pre processing Flashcards
What is Data Preprocessing?
A step in data science that transforms raw data into a format that can be understood and analyzed by computers and machine learning.
Why is Data Preprocessing important?
Real-world data is often dirty, incomplete, noisy, or inconsistent, and preprocessing helps clean and prepare it for analysis.
What are the major tasks involved in Data Preprocessing?
Data cleaning, data integration, data reduction, and data transformation.
What is Data Cleaning?
The process of handling missing values, smoothing noisy data, identifying/removing outliers, and resolving inconsistencies.
What are the common issues in real-world data?
Missing values, noise, errors, duplicate records, and inconsistent data formats.
What are some methods to handle missing data?
Removal, interpolation, replacement with mean or most common value.
What is noisy data?
Random errors or variance in a dataset that can affect model performance.
What are common techniques for handling noisy data?
Binning, clustering, and combined computer-human inspection.
What is Data Integration?
The process of combining data from multiple sources to create a unified dataset.
What challenges arise in Data Integration?
Schema mismatches, redundancy, and inconsistencies across different sources.
What is Data Reduction?
Reducing the amount of data while maintaining its integrity to improve efficiency.
What are the types of Data Reduction?
Numerosity reduction and dimensionality reduction.
What is Numerosity Reduction?
Reducing the number of data objects (rows) in a dataset.
What is Dimensionality Reduction?
Reducing the number of features (columns) while preserving data structure.
What are common Numerosity Reduction methods?
Random sampling, stratified sampling, and random over/undersampling.
What is Random Sampling?
Randomly selecting a subset of data points to reduce computational cost.
What is Stratified Sampling?
Selecting a sample that maintains the original proportions of different groups in the dataset.
What is Random Over/Undersampling?
Altering the sample proportions to balance class distributions for classification tasks.
What are common Dimensionality Reduction methods?
Linear Regression, Decision Trees, Random Forest, PCA, Functional Data Analysis (FDA).
How does Linear Regression help in Dimensionality Reduction?
By identifying independent variables with weak predictive power and eliminating them.
What is PCA (Principal Component Analysis)?
A technique that transforms data into new components, capturing the most variance while reducing dimensionality.
What is the difference between Supervised and Unsupervised Dimensionality Reduction?
Supervised focuses on improving predictions, while Unsupervised reduces data size without prediction considerations.
What is Data Transformation?
Modifying the dataset to ensure it is suitable for analysis and improves model performance.
What are the key types of Data Transformation?
Normalization and Standardization.