Lecture 9 Flashcards
(12 cards)
9.1 Why can it be useful to normalise each feature into the range [0-1] before computing Euclidean distance between vectors?
- makes distance computations weight the contribution of each feature more evenly.
- useful for nearest neighbor activities such as k-nn prediction or k-means clustering & activities such as the matrix factorisation approach to imputation
- also good when plotting heat maps to show the intensity of objects across different features.
- 2 understand the following methods for outlier detection
i) Distance from the centre of the data and their relative advantages/disadvantages
compute the distance (Euclidian) of each object from the “centre” of the data. The further an object is from the centre (nearer the edge), the more likely it is to be an outlier.
– The outlier score of an instance should be relative
to its locality not to the whole dataset for more accurate results
+ simple.
9.3 why is performing clustering on a dataset useful?
• useful not just for outlier detection.
– Market segmentation
– Image analysis
– Search engine result presentation
9.4 What are the steps of the k-means algorithm?
• Given parameter k, the k-means algorithm is implemented in four steps:
- Select k seed points as the initial cluster centres
- Assign each object to the cluster with the nearest seed point
- Compute new seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster)
- Go back to Step 2, stop when the assignment does not change
9.5 Identify scenarios where the k-means algorithm may perform poorly.
- The clusters are expected to be of similar size- doesnt do well when that’s no the case
- It works well on some data sets, while failing on others - not consistent
- suboptimal splits may be generated especially with random seeds
9.6 Explain the steps of (agglomerative) hierarchical clustering, using:
i) Single linkage
and the advantages/ disadvantages of each.
• Similarity of two clusters is based on the two most similar (closest) points in the different clusters (Determined by one pair of points, i.e., by one link in the proximity graph.)
+ Can handle non-elliptical shapes
What are the advantages of clustering algorithms in general?
+ Work for many types of data
+ Clusters can be regarded as summaries of the data
+ Once the clusters are obtained, need only compare any object against the clusters to determine whether it is an outlier (fast)
9.6 Explain the steps of (agglomerative) hierarchical clustering, using:
ii) Complete linkage
and the advantages/ disadvantages of each.
• Similarity of two clusters is based on the two least similar (most distant) points in the different clusters
(Determined by all pairs of points in the two clusters)
+ Less susceptible to noise and outliers
– Tends to break large clusters
– Biased towards globular clusters
9.6 Explain the steps of (agglomerative) hierarchical clustering, using:
iii) Average linkage
and the advantages/ disadvantages of each.
• Proximity of two clusters is the average of pairwise proximity between points in the two clusters.
• Need to use average connectivity for scalability since total proximity favors large clusters
• Compromise between Single and Complete Link
+ Less susceptible to noise and outliers
– Biased towards globular clusters
– Once a decision is made to combine two clusters, it cannot be undone
- 2 understand the following methods for outlier detection
ii) Clustering based outlier detection and their relative advantages/disadvantages
• Each instance is associated with exactly one cluster and its outlier score is equal to the distance from its cluster centre.
– Need an automatic algorithm that computes the cluster centroids and assigns each object to exactly one cluster.
- 2 understand the following methods for outlier detection
iii) k nearest neighbour based outlier detection and their relative advantages/disadvantages
The outlier score of an object is the distance to its k-th nearest neighbor (k-NN distance)
Given an outlier score associated with each object
Sort the objects in order of score (highest to lowest)
Select the n objects with highest outlier score
– Hard to determine best value of k
+ good results
What are the disadvantages of clustering algorithms in general?
– Clustering is expensive and does not scale up well for large data sets
– Sensitivity to noise and outliers
– Difficulty handling different sized clusters and convex shapes
– Breaking large clusters