Data Preprocessing Flashcards

1
Q

7 DATA PROCESSING TASKS / METHODS

A
  1. Aggregation
  2. Sampling
  3. Dimensionality Reduction
  4. Feature Subset Selection
  5. Feature Creation
  6. Discretization and Binarization
  7. Attribute Transformation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

combining two or more attributes into a single attribute.

A

Aggregation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3 PURPOSE OF AGGREGATION

A
  • Data Reduction
  • Change of Scale
  • More Stable Data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

is the main technique employe for data selection.

A

Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

4 TYPES OF SAMPLING

A
  • Sampling without replacement
  • Sampling with replacement
  • Simple Random Sampling
  • Stratified Sampling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

a type of sampling where each item is selected, it is removed from the population.

A

Sampling with replacement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

a type of sampling where objects are not removed from the population as they are selected.

A

Sampling without replacement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

a type of sampling where there is an equal probability of selecting any particular items.

A

Simple Random Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

a type of sampling where it splits the data into several partitions, then drawn random samples from each partition.

A

Stratified Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

is the number of samples in a data set.

A

SAMPLE SIZE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

2 SAMPLE SIZE DETERMINATION

A
  • Statistics
  • Machine Learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

a determination where it implies the confidence interval, for parameter estimate or desires statistical power of test.

A

Statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

a determination where it implies that often more is better, cross-validated accuracy.

A

Machine Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

when dimensionality increases, the size of the data space grows exponentially.

A

Curse of Dimensionality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

its purpose of to avoid the curse of dimensionality.

A

Dimensionality Reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Reduces the amount of time and memory required by data mining algorithms.

A

Dimensionality Reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

3 TECHNIQUES FOR DIMENSION REDUCTION

A
  1. Principal Component Analysis
  2. ISOMAP
  3. Low Dimensional Embedding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

is another way to reduce dimensionality of data.

A

Feature Subset Selection

19
Q

2 TYPES OF FEATURES

A
  1. Redundant Features
  2. Irrelevant Features
20
Q

is a type of feature where there are many duplicate or all of the information contained in one or more other attribute.

A

Redundant Features

21
Q

is a type of feature where it contains no information that is useful foe the data mining task at hand.

A

Irrelevant Features

22
Q

4 APPROACHES IN FEATURE SUBSET SELECTION

A
  1. Embedded Approach
  2. Filter Approach
  3. Brute-force Approach
  4. Wrapper Approach
23
Q

feature selection occurs naturally as part of the data mining algorithm,

A

Embedded Approach

24
Q

features are selected before data mining algorithm is run.

A

Filter Approach

25
try all possible feature subsets as input to data mining algorithm and choose the best.
Brute-force Approach
26
use the data mining algorithm as a black box to find bests subsets of attributes.
Wrapper Approach
27
creates new attributes that can capture the important information in a data set much more efficiently than the original attributes.
Feature Creation
28
3 GENERAL METHODOLOGIES FOR FEATURE CREATION
1. Feature Extraction 2. Feature Construction / Feature Engineering 3. Mapping Data to New Space
29
is a methodology in feature selection where it is domain specific.
Feature Extraction
30
is a methodology in feature extraction where it combines features.
Feature Construction / Feature Engineering
31
2 WAYS OF MAPPING DATA TO NEW SPACE
* Fourier Transform * Wavelet Transfor
32
a function that maps the entire set of values of a given attributes to a new set of replacement values such that each old value can be identifies with one of the new values.
Attribute Transformation
33
are numerical measure of how alike two data objects are.
Similarity
34
numerical measure of how different two data objects are.
Dissimilarity
35
refers to a similarity or dissimilarity.
Proximity
36
6 METHODS TO KNOW THE SIMILARITY OR DISSIMILARITY:
* Euclidean Distance * Mikowski Distance * Mahalanobis Distance * Cosine Similarity * Correlation * Rank Correlation
37
is the generalization of Euclidean.
Mikowski Distance
38
measures the linear relationship between two variables.
Correlation
39
measures the degree of similarity between two ratings.
Rank Correlation
40
describes the likelihood of a random variable taking a given value.
Probability Density (Function)
41
is a non-parametric way to estimate the probability density function of a random variable.
Kernel Desnity Estimation
42
implies that the simplest approach is to divide region into a number of rectangular cells of equal volume and define density as # of points the cell contains.
Euclidean Density - Cell-Based
43
implies that the Euclidean density is the number of points within a specified radius of the point.
Euclidean Density - Center-Based