Outlier Identification and Removal Flashcards

1
Q

WHEN CAN WE USE STD OF SAMPLE AS CUT-OFF FOR IDENTIFYING OUTLIERS? P73

A

When the distribution is Gaussian or Gaussian-like (68,95,99.7 rule)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

WHAT ARE THE CUT-OFF VALUES FOR OUTLIERS IN GAUSSIAN/GAUSSIAN-LIKE DISTRIBUTION? P73

A

Three standard deviations from the mean is a common cut-off in practice for identifying outliers in a Gaussian or Gaussian-like distribution. For smaller samples of data, perhaps a value of 2 standard deviations (95 percent) can be used, and for larger samples, perhaps a value of 4 standard deviations (99.9 percent) can be used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

HOW CAN WE COMPUTE CUT-OFF FOR OUTLIERS IN GAUSSIAN/GAUSSIAN-LIKE DISTRIBUTION? (code) P74

A

Cut_off=data_std*3
Lower,upper= data_mean-cut_off, data_mean+cut_off
Values lower than Lower and higher that upper, are outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

WHAT IS A GOOD WAY FOR HANDLING OUTLIERS IN NON-GAUSSIAN DISTRIBUTED DATA MANUALLY? AND HOW IS IT CALCULATED?

A

Interquartile range, (75th percentile -25th percentile), it’s calculated by defining limits on the sample values that are a factor k of the IQR below the 25th percentile or above the 75th percentile. K usually is 1.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

HOW TO FIND 25TH AND 75TH PERCENTILE? AND CALCULATE CUTOFF? CODE P76

A

Percentile(data,25) or percentile(data,75)
IQR=q75-q25
CUTOFF=IQR*1.5
LOWER,UPPER=Q25-CUTOFF,Q75+CUTOFF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

WITH WHAT CLASS IN SKLEARN CAN WE AUTOMATICALLY DETECT OUTLIERS? WHAT IS ITS WEAKNESS? HOW DOES IT WORK? P77 (WORKED EXAMPLE P78 P79)

A

LocalOutlierFactor.
This can work well for feature spaces with low dimensionality (few features), although it can become less reliable as the number of features is increased, referred to as the curse of dimensionality.
Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. It marks each row in the training dataset as normal (1) or an outlier (-1).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly