Chapter 3 Flashcards

1
Q

what are 3 types of approaches to reduce noise in the data?

A
  1. detect and remove outliers from the data
  2. impute missing values in our data (including the outliers we removed)
  3. transform our data in order to identify the most important parts of it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

definition of an outlier?

A

an outlier is an observation point that is distant from other observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are the two causes for outliers?

A

causes for outliers:

  1. measurement error
  2. variability of the phenomenon that we observe
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is a risk when applying domain knowlegde to remove outliers?

A

in some situations outliers by existing might carry some information which could be filtered out by applying domain knowledge

> e.g. heart rate of 220 unlikely but reflect extreme physical stress causing the chest strap to malfunction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is a general problem of the domain knowledge approach?

A

we simply do not always have that information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is a risk of non-domain knowledge outlier removal methods?

> solutions?

A

if we dont have domain knowledge, this is a unsupervised learning task

> high risk of removing points that are not measurement errors

solutions:

  1. visual inspection
  2. monitor machine learning performance with and without outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what are distribution based outlier models?

A

distribution based approaches are based on the probability distribution of the data. we assume that the data follows a certain distribution and remove those datapoints outside of certain bounds of the distribution

> mainly target single attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

explain chauvenets criterion

A
  1. assume data follows a single distribution (normal)
  2. for each datapoint compute the probability of observation under that distribution
  3. reject a measurement each measurement that has a probability lower than 1/2N

> 2 can be replaced by parameter c, which specifies the degree of certainty for the identification of outliers

> higher c corresponds to higher chane that identified outliers are truly outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

explain mixture models

A
  1. assume K distributions that describe data
  2. find parameters for the distributions that maximize the likelihood to observe our attributes
  3. points with lowes probability of being observed given the distributions are candiates for outliers

> exact criterion depends on the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what are distance based outliers models?

A

distance based models:

consider the distance between a point and other points in the dataset

> this is possible for individual, but also for multiple attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

explain simple distance based approach

A

simple distance based approach: global view towards data

  1. consider the distance of a point to all other points
  2. define minimum distance dmin within which we consider a point to be close to another point
  3. define fraction of points in the dataset at distance of more than dmin
  4. if that fraction is more than fmin, point is outlier
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

explain local outlier factor

A

local outlier factor: take local density into account

  1. define kdist for each point: largest distance among the distances of the k closest points
  2. define reachability distance for each point x: reachability distance is real distance if other point is not among the k nearest neighbours of x, otherwise it is kdist
  3. consider local reachability density around point x: 1/average reachability distance for all k neighbours of x
  4. compare local reachbility density of point x to all of its k neighbouring points

> the higher point x local reachability density compare to neighbours, the lower local outlier factor becomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

advantage of median imputation over mean imputation?

A

median is not as sensitive to extreme values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

explain the kalman filter

A

kalman filter: identify outliers and impute them in one go

> for each observation, distinguish between a latent state that we don’t direcly observe and a measurement

> analyse the deviation between measurement and state

> if deviation is too big, assume measurement is noise and impute with latent state value, otherwise use original value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

explain lowpass filtering

> parameter n?

A

lowpass filter: remove irrelevant frequencies from data

> remove data originating from a form of periodicity above a certain frequency and leave data a lower frequencies untouched

> user transfer function: the higher the frequency, the lower the magnitude of the transfer function

> parameter n: order of the filter, the higher the order, the more steeply the magnitude of the frequencies above cutoff drop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

explain PCA

A

PCA:

> find lines (or hyperplanes if dimension p > 2) and order them in terms of how much variance they explain

> choose n <= p eigenvectors that enable expressing our data projected on those eigenvectors

> reduces the dimensionality of the data, and we lose information (altough that could be noise)