Lecture 4 - Data Understanding II Flashcards

1
Q

What does data preparation do with the information provided by data understanding?

A
  • Selects attributes
  • Reduces the dimension of the data set
  • Selects records
  • Treats missing values
  • Treats outliers
  • Improves data quality
  • Unify and transform data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is feature extraction?

A

The construction of new features from the given attributes

For example instead of tasks finished, hours worked, number of hours needed usually for task -> creating new attribute “efficiency”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What can be used for simple models feature extraction?

A

Non-linear functions like x^p, divided by x, log(x), sin(x) etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to predict y from x?

A

Prior knowledge, is the y dependant on x, visualization, trial and error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What’s the disadvantage of methods like PCA for feature extraction?

A

Dimensionality reduction techniques like PCA lead to features that can no longer be interpreted in a meaningful way, how to understand a feature that is a linear combination of 10 attributes?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some complex data type feature extractions?

A

Text data analysis -> frequency of keywords
Time series data analysis -> fourier or wavelet coefficients
Graph data-analysis -> number of vertices, edges

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does feature selection refer to?

A

Techniques used to choose a subset of the features that is as small as possible and sufficient for the data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the reasons for feature selection?

A
  • Prior knowledge: we know something is irrelevant
  • Quality control: majority of values missing or bad
  • Non-informative: eg all values same
  • Redundancy: Identical or correlated values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does record selection refer to? Why is it done?

A

Selecting only some rows of the data.
- Timeliness: older data might be outdated
- Representativeness: The sample in the database might not be representative for the whole population
- Rare events: Useful for something like stock market crashes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to choose records for rare events?

A
  • Artificially increase the proportion of the rare events by adding copies
  • Choose only subset of the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does data cleansing refer to?

A

Detecting and correcting/removing inaccurate, incorrect or incomplete records from the data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to improve data quality?

A
  • Turn all characters same sensitivity
  • Remove spaces etc.
  • Fix the format of numbers
  • Split the fields “Chocolate, 100g” -> “chocolate” “100.0”
  • Normalize the writing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the four discretizations?

A
  1. Equi-width discretization: Splits range into same length intervals [0-20, 20-40, 40-60]
  2. Equi-frequency discretization: Splits range into intervals with roughly the same number of records [4,4,4,4]
  3. V-optimal discretization: Minimizes the sum of n*V, where n is the number of data objects and V is the sample variance
  4. Minimal entropy discretization: minimizes the uncertainty
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why should data sometimes be normalized?

A

To guarantee impartiality for models that use distances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is min-max normalization?

A

All the values are scaled between 0 and 1, outliers affect a lot.
x = x-min_x / (max_x - min_x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is z-score standardization?

A

scales the data to have a mean of 0 and deviation of 1
x = x - mean(x) / variance(x)

17
Q

What is robust z-score standardization?

A

x = x - median(x) / IQR(x)

18
Q

What is decimal scaling?

A

For attribute X and the smallest integer value s larger than log_10(max(x))

x = x/10^s

19
Q

What does centering the data matrix mean?

A

Removing the mean from all the rows of matrrix X, it moves the data to the center.

20
Q

What the number of possible 2D scatter plots for attributes m?

A

m(m-1), so for 50 50*49 = 2450

21
Q

Why do we want to change data to lower dimensional?

A

There could be hundreds of thousands of attributes, to include them all in a plot we need to define a measure that evaluates lower-dimensional plots of daata in term of how well the plot preserves the original structure

22
Q

What are parallel coordinates?

A

They draw the coordinate axes parallel to each other, so that there is no limitation for the number of axes to be displayed.

Aka plot for multiple attributes like \/_/_

23
Q

What is the basic idea for dimensionality reduction?

A

Change the data from n-dimensional space to q-dimensional space (q= 2 or 3)

R^n -> R^q

24
Q

What is a linear map?

A

New attributes are linear combinations of old ones.

new_feature = 0.5feature_1 + 0.3feature_2

25
How does PCA work?
PCA uses the variance in the data as the structure preservation criterion? It then tries to preserve as much of the original variance of the data when projected to a lower-dimensional space. It uses an orthogonal transformation to convert a set of observation of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
26
How does the principal components get determined?
The first PC has the largest possible variance, and each succeedin component in turn has the highest variance under the constraint that it is orthogonal to the preceding components
27
Is PCA sensitive to the relative scaling of the original variables?
Yes, usually Z-score standardized
28
What is an eigenvector?
The number of principle components.
29
what is t-SNE?
t-distributed stochastic neighbor embedding. Non-linear dimensionality reduction method
30
How does t-SNE work?
Similar items end up close together points and dissimilar at distant points Generates clusters even when the data doesn't support this
31
What are the two stages of t-SNE?
1. A probability distribution over pairs of high-dimensional objects is constructed so that similar objects receive higher probability while dissimilar points receive lower probability 2. A similar probability distribution is generated for the points in the low-dimensional map
32
What are some dimensionality reduction methods?
1. PCA 2. t-SNE 3. Kernel PCA (non-linear) 4. Linear discriminant analysis (used in classification, finds low dimensional reprsentation of the data such that separates classes well)