Data pre-processing Flashcards

1
Q

Possible data pre-processing procedures

A

Input preprocessing:

1) input centering
2) input normalization
3) input whitening
4) data cleaning

PCA can also be considered a data pre-processing procedure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is data pre-processing used?

A

Data should be pre-processed to make the suited for Learning, and to achieve better results.
Pre-processing operations must be done without snooping the data!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Input preprocessing: general procedure

A

Given the input data matrix X app R^N*d
input pre-processing consists in finding a standardization transform Φ, that gives
zn = Φ(xn) for any n = 1,..,N

The final hypothesis will be
h(x) = h(Φ(x))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Input centering

A

Goal: to remove any bias from the input.
zn = xn - x_

x_ = 1/N sum(n=1,N) xn (mean of the data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Input normalization

A
Goal: to scale the input wrt its variance.
Assuming it is centered:
zn = [zn1 … znd]' = [xn1/σ1 … xnd/σd]
where
σi^2 = 1/N sum(n=1,N) xni^2 , i=1,...d
(variance of the features)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Input whitening

A

Goal: to decorrelate input samples, if it is known that they are decorrelated.
zn = A^-1/2 xn
where A is the covariance matrix of the x’s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data cleaning

A

Goal: to remove outliers

  • use simple models
  • compute leverage score of validation
  • look at the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Causes of outliers data

A
  • stochastic output noise

- System complexity not modelled (deterministic noise)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly