what are 3 types of approaches to reduce noise in the data?

1. detect and remove outliers from the data

2. impute missing values in our data (including the outliers we removed)

3. transform our data in order to identify the most important parts of it

definition of an outlier?

an outlier is an observation point that is distant from other observations

what are the two causes for outliers?

causes for outliers:

1. measurement error

2. variability of the phenomenon that we observe

what is a risk when applying domain knowlegde to remove outliers?

in some situations outliers by existing might carry some information which could be filtered out by applying domain knowledge

> e.g. heart rate of 220 unlikely but reflect extreme physical stress causing the chest strap to malfunction

what is a general problem of the domain knowledge approach?

we simply do not always have that information

what is a risk of non-domain knowledge outlier removal methods?

> solutions?

if we dont have domain knowledge, this is a unsupervised learning task

> high risk of removing points that are not measurement errors

solutions:

1. visual inspection

2. monitor machine learning performance with and without outliers

what are distribution based outlier models?

distribution based approaches are based on the probability distribution of the data. we assume that the data follows a certain distribution and remove those datapoints outside of certain bounds of the distribution

> mainly target single attributes

explain chauvenets criterion

1. assume data follows a single distribution (normal)

2. for each datapoint compute the probability of observation under that distribution

3. reject a measurement each measurement that has a probability lower than 1/2N

> 2 can be replaced by parameter c, which specifies the degree of certainty for the identification of outliers

> higher c corresponds to higher chane that identified outliers are truly outliers

explain mixture models

1. assume K distributions that describe data

2. find parameters for the distributions that maximize the likelihood to observe our attributes

3. points with lowes probability of being observed given the distributions are candiates for outliers

> exact criterion depends on the data

what are distance based outliers models?

distance based models:

consider the distance between a point and other points in the dataset

> this is possible for individual, but also for multiple attributes

explain simple distance based approach

simple distance based approach: global view towards data

1. consider the distance of a point to all other points

2. define minimum distance dmin within which we consider a point to be close to another point

3. define fraction of points in the dataset at distance of more than dmin

4. if that fraction is more than fmin, point is outlier

explain local outlier factor

local outlier factor: take local density into account

1. define kdist for each point: largest distance among the distances of the k closest points

2. define reachability distance for each point x: reachability distance is real distance if other point is not among the k nearest neighbours of x, otherwise it is kdist

3. consider local reachability density around point x: 1/average reachability distance for all k neighbours of x

4. compare local reachbility density of point x to all of its k neighbouring points

> the higher point x local reachability density compare to neighbours, the lower local outlier factor becomes

advantage of median imputation over mean imputation?

median is not as sensitive to extreme values

explain the kalman filter

kalman filter: identify outliers and impute them in one go

> for each observation, distinguish between a latent state that we don't direcly observe and a measurement

> analyse the deviation between measurement and state

> if deviation is too big, assume measurement is noise and impute with latent state value, otherwise use original value

explain lowpass filtering

> parameter n?

lowpass filter: remove irrelevant frequencies from data

> remove data originating from a form of periodicity above a certain frequency and leave data a lower frequencies untouched

> user transfer function: the higher the frequency, the lower the magnitude of the transfer function

> parameter n: order of the filter, the higher the order, the more steeply the magnitude of the frequencies above cutoff drop

explain PCA

PCA:

> find lines (or hyperplanes if dimension p > 2) and order them in terms of how much variance they explain

> choose n <= p eigenvectors that enable expressing our data projected on those eigenvectors

> reduces the dimensionality of the data, and we lose information (altough that could be noise)