Lecture 6 Flashcards
(7 cards)
6.1 Explain the importance of finding outliers.
• can be different from the noise data
– Noise is random error or variance in a measured variable
– Noise should be removed before outlier detection
• interesting: Violation of the mechanism that generates the normal data
• can skew results
6.1 Give concrete examples where finding outliers would be useful.
Applications:
– Credit card fraud detection (change in behaviour)
– Telecom fraud detection
– Medical analysis (unusual test results)
– Sports (identifying exceptional talent)
6.2 What is an outlier?
A data object that deviates significantly from the normal objects as if it were generated by a different mechanism
6.3 What is the difference between a local and a contextual outlier?
Global outlier (or point anomaly)
• If it significantly deviates from the rest of the data set
E.g. Intrusion detection in computer networks
• Issue: Find an appropriate measurement of deviation
Contextual outlier (or conditional outlier) • if it deviates significantly based on a selected context e.g Is 5 degrees in Melbourne an outlier? (depending on summer or winter?) • Attributes of data should be divided into two groups – Contextual attributes: defines the context, e.g., time & location – Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature • Issue: How to define or formulate meaningful context?
6.6 How can histograms be used for outlier detection?
• Divide the data into bins. Bins that have a very short height can be treated as outliers.
6.7 What are some of the challenges of outlier detection?
– The border between normal and outlier objects is often a grey area
– Choice of distance measure among objects and the model of relationship among objects are often application-dependent E.g., clinic data: a small deviation could be an outlier; while in marketing analysis, larger fluctuations
• Handling noise in outlier detection - may blur the distinction between normal objects and outliers, may help hide outliers and reduce the effectiveness of outlier detection
• Understandability - why these are outliers: Justification of the detection
– Specifying the degree of an outlier
6.6 What are the advantages and disadvantages for using histograms for outlier detection?
Advantages:
• Don’t need prior information or domain knowledge
• makes fewer assumptions about the data, and thus can be applicable in more scenarios
Disadvantages:
• Hard to choose an appropriate bin size for histogram
– Too small bin size → normal objects in empty/rare bins, false positive
– Too big bin size → outliers in some frequent bins, false negative