Data Preprocessing Flashcards
7 DATA PROCESSING TASKS / METHODS
- Aggregation
- Sampling
- Dimensionality Reduction
- Feature Subset Selection
- Feature Creation
- Discretization and Binarization
- Attribute Transformation
combining two or more attributes into a single attribute.
Aggregation
3 PURPOSE OF AGGREGATION
- Data Reduction
- Change of Scale
- More Stable Data
is the main technique employe for data selection.
Sampling
4 TYPES OF SAMPLING
- Sampling without replacement
- Sampling with replacement
- Simple Random Sampling
- Stratified Sampling
a type of sampling where each item is selected, it is removed from the population.
Sampling with replacement
a type of sampling where objects are not removed from the population as they are selected.
Sampling without replacement
a type of sampling where there is an equal probability of selecting any particular items.
Simple Random Sampling
a type of sampling where it splits the data into several partitions, then drawn random samples from each partition.
Stratified Sampling
is the number of samples in a data set.
SAMPLE SIZE
2 SAMPLE SIZE DETERMINATION
- Statistics
- Machine Learning
a determination where it implies the confidence interval, for parameter estimate or desires statistical power of test.
Statistics
a determination where it implies that often more is better, cross-validated accuracy.
Machine Learning
when dimensionality increases, the size of the data space grows exponentially.
Curse of Dimensionality
its purpose of to avoid the curse of dimensionality.
Dimensionality Reduction
Reduces the amount of time and memory required by data mining algorithms.
Dimensionality Reduction
3 TECHNIQUES FOR DIMENSION REDUCTION
- Principal Component Analysis
- ISOMAP
- Low Dimensional Embedding
is another way to reduce dimensionality of data.
Feature Subset Selection
2 TYPES OF FEATURES
- Redundant Features
- Irrelevant Features
is a type of feature where there are many duplicate or all of the information contained in one or more other attribute.
Redundant Features
is a type of feature where it contains no information that is useful foe the data mining task at hand.
Irrelevant Features
4 APPROACHES IN FEATURE SUBSET SELECTION
- Embedded Approach
- Filter Approach
- Brute-force Approach
- Wrapper Approach
feature selection occurs naturally as part of the data mining algorithm,
Embedded Approach
features are selected before data mining algorithm is run.
Filter Approach