Chapter 3 Flashcards

Question 1

Q

what are 3 types of approaches to reduce noise in the data?

Answer

A

detect and remove outliers from the data
impute missing values in our data (including the outliers we removed)
transform our data in order to identify the most important parts of it

Question 2

Q

definition of an outlier?

Answer

A

an outlier is an observation point that is distant from other observations

Question 3

Q

what are the two causes for outliers?

Answer

A

causes for outliers:

measurement error
variability of the phenomenon that we observe

Question 4

Q

what is a risk when applying domain knowlegde to remove outliers?

Answer

A

in some situations outliers by existing might carry some information which could be filtered out by applying domain knowledge

> e.g. heart rate of 220 unlikely but reflect extreme physical stress causing the chest strap to malfunction

Question 5

Q

what is a general problem of the domain knowledge approach?

Answer

A

we simply do not always have that information

Question 6

Q

what is a risk of non-domain knowledge outlier removal methods?

> solutions?

Answer

A

if we dont have domain knowledge, this is a unsupervised learning task

> high risk of removing points that are not measurement errors

solutions:

visual inspection
monitor machine learning performance with and without outliers

Question 7

Q

what are distribution based outlier models?

Answer

A

distribution based approaches are based on the probability distribution of the data. we assume that the data follows a certain distribution and remove those datapoints outside of certain bounds of the distribution

> mainly target single attributes

Question 8

Q

explain chauvenets criterion

Answer

A

assume data follows a single distribution (normal)
for each datapoint compute the probability of observation under that distribution
reject a measurement each measurement that has a probability lower than 1/2N

> 2 can be replaced by parameter c, which specifies the degree of certainty for the identification of outliers

> higher c corresponds to higher chane that identified outliers are truly outliers

Question 9

Q

explain mixture models

Answer

A

assume K distributions that describe data
find parameters for the distributions that maximize the likelihood to observe our attributes
points with lowes probability of being observed given the distributions are candiates for outliers

> exact criterion depends on the data

Question 10

Q

what are distance based outliers models?

Answer

A

distance based models:

consider the distance between a point and other points in the dataset

> this is possible for individual, but also for multiple attributes

Question 11

Q

explain simple distance based approach

Answer

A

simple distance based approach: global view towards data

consider the distance of a point to all other points
define minimum distance dmin within which we consider a point to be close to another point
define fraction of points in the dataset at distance of more than dmin
if that fraction is more than fmin, point is outlier

Question 12

Q

explain local outlier factor

Answer

A

local outlier factor: take local density into account

define kdist for each point: largest distance among the distances of the k closest points
define reachability distance for each point x: reachability distance is real distance if other point is not among the k nearest neighbours of x, otherwise it is kdist
consider local reachability density around point x: 1/average reachability distance for all k neighbours of x
compare local reachbility density of point x to all of its k neighbouring points

> the higher point x local reachability density compare to neighbours, the lower local outlier factor becomes

Question 13

Q

advantage of median imputation over mean imputation?

Answer

A

median is not as sensitive to extreme values

Question 14

Q

explain the kalman filter

Answer

A

kalman filter: identify outliers and impute them in one go

> for each observation, distinguish between a latent state that we don’t direcly observe and a measurement

> analyse the deviation between measurement and state

> if deviation is too big, assume measurement is noise and impute with latent state value, otherwise use original value

Question 15

Q

explain lowpass filtering

> parameter n?

Answer

A

lowpass filter: remove irrelevant frequencies from data

> remove data originating from a form of periodicity above a certain frequency and leave data a lower frequencies untouched

> user transfer function: the higher the frequency, the lower the magnitude of the transfer function

> parameter n: order of the filter, the higher the order, the more steeply the magnitude of the frequencies above cutoff drop

Question 16

Q

explain PCA

Answer

Study These Flashcards

A

PCA:

> find lines (or hyperplanes if dimension p > 2) and order them in terms of how much variance they explain

> choose n <= p eigenvectors that enable expressing our data projected on those eigenvectors

> reduces the dimensionality of the data, and we lose information (altough that could be noise)

Chapter 3 Flashcards

(16 cards)