Lecture 9 Flashcards

1
Q

9.1 Why can it be useful to normalise each feature into the range [0-1] before computing Euclidean distance between vectors?

A
  • makes distance computations weight the contribution of each feature more evenly.
  • useful for nearest neighbor activities such as k-nn prediction or k-means clustering & activities such as the matrix factorisation approach to imputation
  • also good when plotting heat maps to show the intensity of objects across different features.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. 2 understand the following methods for outlier detection

i) Distance from the centre of the data and their relative advantages/disadvantages

A

compute the distance (Euclidian) of each object from the “centre” of the data. The further an object is from the centre (nearer the edge), the more likely it is to be an outlier.
– The outlier score of an instance should be relative
to its locality not to the whole dataset for more accurate results
+ simple.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

9.3 why is performing clustering on a dataset useful?

A

• useful not just for outlier detection.
– Market segmentation
– Image analysis
– Search engine result presentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

9.4 What are the steps of the k-means algorithm?

A

• Given parameter k, the k-means algorithm is implemented in four steps:

  1. Select k seed points as the initial cluster centres
  2. Assign each object to the cluster with the nearest seed point
  3. Compute new seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster)
  4. Go back to Step 2, stop when the assignment does not change
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

9.5 Identify scenarios where the k-means algorithm may perform poorly.

A
  1. The clusters are expected to be of similar size- doesnt do well when that’s no the case
  2. It works well on some data sets, while failing on others - not consistent
  3. suboptimal splits may be generated especially with random seeds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

9.6 Explain the steps of (agglomerative) hierarchical clustering, using:
i) Single linkage
and the advantages/ disadvantages of each.

A

• Similarity of two clusters is based on the two most similar (closest) points in the different clusters (Determined by one pair of points, i.e., by one link in the proximity graph.)
+ Can handle non-elliptical shapes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the advantages of clustering algorithms in general?

A

+ Work for many types of data
+ Clusters can be regarded as summaries of the data
+ Once the clusters are obtained, need only compare any object against the clusters to determine whether it is an outlier (fast)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

9.6 Explain the steps of (agglomerative) hierarchical clustering, using:
ii) Complete linkage
and the advantages/ disadvantages of each.

A

• Similarity of two clusters is based on the two least similar (most distant) points in the different clusters
(Determined by all pairs of points in the two clusters)
+ Less susceptible to noise and outliers
– Tends to break large clusters
– Biased towards globular clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

9.6 Explain the steps of (agglomerative) hierarchical clustering, using:
iii) Average linkage
and the advantages/ disadvantages of each.

A

• Proximity of two clusters is the average of pairwise proximity between points in the two clusters.
• Need to use average connectivity for scalability since total proximity favors large clusters
• Compromise between Single and Complete Link
+ Less susceptible to noise and outliers
– Biased towards globular clusters
– Once a decision is made to combine two clusters, it cannot be undone

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. 2 understand the following methods for outlier detection

ii) Clustering based outlier detection and their relative advantages/disadvantages

A

• Each instance is associated with exactly one cluster and its outlier score is equal to the distance from its cluster centre.
– Need an automatic algorithm that computes the cluster centroids and assigns each object to exactly one cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. 2 understand the following methods for outlier detection

iii) k nearest neighbour based outlier detection and their relative advantages/disadvantages

A

The outlier score of an object is the distance to its k-th nearest neighbor (k-NN distance)
Given an outlier score associated with each object
Sort the objects in order of score (highest to lowest)
Select the n objects with highest outlier score
– Hard to determine best value of k
+ good results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the disadvantages of clustering algorithms in general?

A

– Clustering is expensive and does not scale up well for large data sets
– Sensitivity to noise and outliers
– Difficulty handling different sized clusters and convex shapes
– Breaking large clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly