Outliers Flashcards
(32 cards)
An _ is a data point that differs significantly from other values in a dataset.
Outlier
A single data point that significantly deviates from the rest of the dataset.
Point Outlier
This type of outlier is an isolated data point that is far away from the main body of the data.
Global outlier
It is often easy to identify and remove.
This type of outlier is a data point that is unusual in a specific context but may not be outlier in a different context.
Contextual outlier
It is often more difficult to identify and may require additional information or domain knowledge to determine its significance.
What causes outliers?
- variability in data
- measurement errors
- novel phenomena
Why is it important to identify and handle outliers?
Outliers can skew results and affect the performance of machine learning models. By identifying and removing or handling outliers effectively, we can prevent them from biasing the model, reducing its performance, and hindering its interpretability.
The analysis of outlier data is referred to as _.
Outlier Analysis or Outlier Mining
_ plays a crucial role in ensuring the quality and accuracy of machine learning models.
Outlier detection
What are some techniques to detect outliers?
- statistical tests
- Z-Score
- Interquartile Range (IQR)
- visualization techniques
1. box plots
2. histograms - machine learning algorithms
What are some ways to handle outliers?
- removing outliers
- transforming data
- using models that are robust to outliers
A _ is relatively unaffected by extreme values, such as the median.
resistant statistic
A statistic is resistant if it is relatively unaffected by extreme values.
Which statistic is resistant, the mean or the median?
The median (middle value) is resistant while the mean (average) is not.
df.cgpa.mean()
Example: World Gross (in millions)
With Harry Potter
Mean = $150,742,300
Median = $76,658,500
Without Harry Potter
Mean = $141,889,900
Median = $75,009,000
What should you do if an outlier is not a mistake?
Run the analysis twice: once with the outlier and once without, to assess its impact.
This measures the spread of the data from the mean.
Standard deviation
df.cgpa.std()
Sample standard deviation: s
Population standard deviation: (“sigma”)
What does a larger standard deviation indicate?
A larger standard deviation indicates more variability and that the data are more spread out.
For a bell-shaped distribution, about _ of the data falls within two standard deviations of the mean.
95%
For a population, 95% of the data will be between µ – 2 and µ + 2
= sigma symbol
A _ indicates how many standard deviations a value is from the mean.
Z-score
Z = (X - mean) / Standard Deviation
For a population, !𝑥 is replaced with µ and s is replaced with
= sigma symbol
Remove outliers using z-score
from scipy import stats
df[‘cgpa_zscore’] = stats.zscore(df.cgpa)
df[(df.cgpa-zscore > -3) & (df.cgpa-zscore < 3)]
Advantages of Z-score
- Straightforward
- Easy to use
- Useful for normally distributed data
- Quantifies deviation
Disadvantages of Z-score
- Sensitive to non-uniform or skewed data distributions
- Not effective in datasets with many outliers
- Assumes data follows a normal distribution, making it less reliable for non-normal distributions
What is considered an extreme z-score?
A z-score beyond -2 or 2
The _ divide data into four equal parts.
Quartiles
Q1 is the median of the lower half, Q3 is the median of the upper half.
What is a five-number summary?
minimum (Min) = smallest data value
Q1= median of the values below m
median (m) = middle data value
Q3 = median of the values above m
maximum (Ma) = largest data value
The _ is the value which is greater than P% of the data
Pth percentile
We already used z-scores to determine whether an SAT score of 2100 or
an ACT score of 28 is better
We could also have used percentiles:
ACT score of 28: 91st percentile
SAT score of 2100: 97th percentile
The _ is Q3 - Q1, representing the middle 50% of the data.
Interquartile Range (IQR)
Remove outliers using IQR
q1 = df.placement-exam-marks.quantile(0.25)
q3 = df.placement-exam-marks.quantile(0.75)
iqr = q3 - q1
iqr
upper = q3 + (1.5 * iqr)
lower = q1 - (1.5 * iqr)
df[(df[‘placement-exam-marks’] < upper) & (df[‘placement-exam-marks’] > lower)]