Lecture 4 - Data Understanding II Flashcards by Santeri Isotalo

What does data preparation do with the information provided by data understanding?

Selects attributes
Reduces the dimension of the data set
Selects records
Treats missing values
Treats outliers
Improves data quality
Unify and transform data

How well did you know this?

Not at all

Perfectly

What is feature extraction?

The construction of new features from the given attributes

For example instead of tasks finished, hours worked, number of hours needed usually for task -> creating new attribute “efficiency”

How well did you know this?

Not at all

Perfectly

What can be used for simple models feature extraction?

Non-linear functions like x^p, divided by x, log(x), sin(x) etc.

How well did you know this?

Not at all

Perfectly

How to predict y from x?

Prior knowledge, is the y dependant on x, visualization, trial and error

How well did you know this?

Not at all

Perfectly

What’s the disadvantage of methods like PCA for feature extraction?

Dimensionality reduction techniques like PCA lead to features that can no longer be interpreted in a meaningful way, how to understand a feature that is a linear combination of 10 attributes?

How well did you know this?

Not at all

Perfectly

What are some complex data type feature extractions?

Text data analysis -> frequency of keywords
Time series data analysis -> fourier or wavelet coefficients
Graph data-analysis -> number of vertices, edges

How well did you know this?

Not at all

Perfectly

What does feature selection refer to?

Techniques used to choose a subset of the features that is as small as possible and sufficient for the data analysis

How well did you know this?

Not at all

Perfectly

What are the reasons for feature selection?

Prior knowledge: we know something is irrelevant
Quality control: majority of values missing or bad
Non-informative: eg all values same
Redundancy: Identical or correlated values

How well did you know this?

Not at all

Perfectly

What does record selection refer to? Why is it done?

Selecting only some rows of the data.
- Timeliness: older data might be outdated
- Representativeness: The sample in the database might not be representative for the whole population
- Rare events: Useful for something like stock market crashes

How well did you know this?

Not at all

Perfectly

How to choose records for rare events?

Artificially increase the proportion of the rare events by adding copies
Choose only subset of the data

How well did you know this?

Not at all

Perfectly

What does data cleansing refer to?

Detecting and correcting/removing inaccurate, incorrect or incomplete records from the data set

How well did you know this?

Not at all

Perfectly

How to improve data quality?

Turn all characters same sensitivity
Remove spaces etc.
Fix the format of numbers
Split the fields “Chocolate, 100g” -> “chocolate” “100.0”
Normalize the writing

How well did you know this?

Not at all

Perfectly

What are the four discretizations?

Equi-width discretization: Splits range into same length intervals [0-20, 20-40, 40-60]
Equi-frequency discretization: Splits range into intervals with roughly the same number of records [4,4,4,4]
V-optimal discretization: Minimizes the sum of n*V, where n is the number of data objects and V is the sample variance
Minimal entropy discretization: minimizes the uncertainty

How well did you know this?

Not at all

Perfectly

Why should data sometimes be normalized?

To guarantee impartiality for models that use distances

How well did you know this?

Not at all

Perfectly

What is min-max normalization?

All the values are scaled between 0 and 1, outliers affect a lot.
x = x-min_x / (max_x - min_x)

How well did you know this?

Not at all

Perfectly

What is z-score standardization?

scales the data to have a mean of 0 and deviation of 1
x = x - mean(x) / variance(x)

What is robust z-score standardization?

x = x - median(x) / IQR(x)

What is decimal scaling?

For attribute X and the smallest integer value s larger than log_10(max(x))

x = x/10^s

What does centering the data matrix mean?

Removing the mean from all the rows of matrrix X, it moves the data to the center.

What the number of possible 2D scatter plots for attributes m?

m(m-1), so for 50 50*49 = 2450

Why do we want to change data to lower dimensional?

There could be hundreds of thousands of attributes, to include them all in a plot we need to define a measure that evaluates lower-dimensional plots of daata in term of how well the plot preserves the original structure

What are parallel coordinates?

They draw the coordinate axes parallel to each other, so that there is no limitation for the number of axes to be displayed.

Aka plot for multiple attributes like \/_/_

What is the basic idea for dimensionality reduction?

Change the data from n-dimensional space to q-dimensional space (q= 2 or 3)

R^n -> R^q

What is a linear map?

New attributes are linear combinations of old ones.

new_feature = 0.5feature_1 + 0.3feature_2

How does PCA work?

PCA uses the variance in the data as the structure preservation criterion? It then tries to preserve as much of the original variance of the data when projected to a lower-dimensional space. It uses an orthogonal transformation to convert a set of observation of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

How does the principal components get determined?

The first PC has the largest possible variance, and each succeedin component in turn has the highest variance under the constraint that it is orthogonal to the preceding components

Is PCA sensitive to the relative scaling of the original variables?

Yes, usually Z-score standardized

What is an eigenvector?

The number of principle components.

what is t-SNE?

t-distributed stochastic neighbor embedding. Non-linear dimensionality reduction method

How does t-SNE work?

Similar items end up close together points and dissimilar at distant points Generates clusters even when the data doesn't support this

What are the two stages of t-SNE?

1. A probability distribution over pairs of high-dimensional objects is constructed so that similar objects receive higher probability while dissimilar points receive lower probability 2. A similar probability distribution is generated for the points in the low-dimensional map

What are some dimensionality reduction methods?

1. PCA 2. t-SNE 3. Kernel PCA (non-linear) 4. Linear discriminant analysis (used in classification, finds low dimensional reprsentation of the data such that separates classes well)