Lecture 6 Flashcards

1
Q

6.1 Explain the importance of finding outliers.

A

• can be different from the noise data
– Noise is random error or variance in a measured variable
– Noise should be removed before outlier detection
• interesting: Violation of the mechanism that generates the normal data
• can skew results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

6.1 Give concrete examples where finding outliers would be useful.

A

Applications:
– Credit card fraud detection (change in behaviour)
– Telecom fraud detection
– Medical analysis (unusual test results)
– Sports (identifying exceptional talent)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

6.2 What is an outlier?

A

A data object that deviates significantly from the normal objects as if it were generated by a different mechanism

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

6.3 What is the difference between a local and a contextual outlier?

A

Global outlier (or point anomaly)
• If it significantly deviates from the rest of the data set
E.g. Intrusion detection in computer networks
• Issue: Find an appropriate measurement of deviation

Contextual outlier (or conditional outlier)
• if it deviates significantly based on a selected context
 e.g Is 5 degrees in Melbourne an outlier? (depending on summer or winter?)
• Attributes of data should be divided into two groups
– Contextual attributes: defines the context, e.g., time & location
– Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature
• Issue: How to define or formulate meaningful context?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

6.6 How can histograms be used for outlier detection?

A

• Divide the data into bins. Bins that have a very short height can be treated as outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

6.7 What are some of the challenges of outlier detection?

A

– The border between normal and outlier objects is often a grey area
– Choice of distance measure among objects and the model of relationship among objects are often application-dependent E.g., clinic data: a small deviation could be an outlier; while in marketing analysis, larger fluctuations
• Handling noise in outlier detection - may blur the distinction between normal objects and outliers, may help hide outliers and reduce the effectiveness of outlier detection
• Understandability - why these are outliers: Justification of the detection
– Specifying the degree of an outlier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

6.6 What are the advantages and disadvantages for using histograms for outlier detection?

A

Advantages:
• Don’t need prior information or domain knowledge
• makes fewer assumptions about the data, and thus can be applicable in more scenarios

Disadvantages:
• Hard to choose an appropriate bin size for histogram
– Too small bin size → normal objects in empty/rare bins, false positive
– Too big bin size → outliers in some frequent bins, false negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly