L14 - K Means, GMM, Clustering Measures Flashcards

Question 1

Q

Hierarchical clustering is a deterministic method. What does this mean?

Answer

A

We can repeat the method and get the same results. Generating the same dendrogram.

Question 2

Q

What is the issue with Hierarchical clustering?

Answer

A

Slow -> Can hit O(N^3)

Question 3

Q

What are ‘partition clustering methods’?

Answer

A

Partitions dimensional space as opposed to clustering points.
1. Breaks space into K partitions that are non-overlapping
2. Then we can compare N data-points to K clusters

Question 4

Q

What is K-means clustering?

Answer

A

A partition clustering method

Question 5

Q

What is the process of K-means clustering…

Answer

A

Input N data points and K cluster
1. Choose K random data points to be the initial cluster centroids
2. Assign every point in N to it’s nearest K centroid
3. Recompute the centroid of each cluster to be the actual centroid
4. Check for convergence or termination criteria
5. Repeat 3 to 5

Question 6

Q

What are some options for the terminations criteria?

Answer

A

Few or no re-assignment of data points to different clusters
1. Few or no change to centroids
2. Minimal change in sum of square errors (SSE)

Question 7

Q

What are the pros and cons of using K-means?

Answer

A

Pros:
1. Easy to understand and implement
2. Efficient -> O(ndk*t) -> O(n) i.e Linear complexity
Cons:
1. Requires a centroid to be calculable -> Not possible with categorical data
2. Effectiveness dependent on selection of K
3. Sensitive to outlier

Question 8

Q

What are the limitations of K-means?

Answer

A

Struggles with clusters of different sizes
Struggles with clusters of different shapes
Struggles with clusters of different density
Results depend on starting centroids -> Same data can be clustered completely differently with different initial centroids.

Question 9

Q

How do we overcome the limitations of K-means?

Answer

A

Pre-processing -> Eliminating outliers or standing or normalising the data.

Post processing -> Merge clusters with low SSE, Split loose clusters, Eliminate outlier clusters.

Question 10

Q

What method can we use to choose K?

Answer

A

Elbow method -> Plot SSE for multiple values of K and find the elbow.

Question 11

Q

Since K-means is sensitive to outliers, what clustering method can be used as an alternative?

Answer

A

GMM -> Gaussian Mixture Models

Question 12

Q

What is GMM? What issues of K-means does it solve?

Answer

A

A clustering method that clusters based on probabilities
E.g A is 30% clusters blue and 70% cluster green
Solves issue in K-means regarding cluster outliers

Question 13

Q

What are the pros and cons of GMM compared to K-means?

Answer

A

Pros:
1. Convergence quicker
2. Deals well with outliers
Cons:
1. More complex
2. More computationally expensive

Question 14

Q

Give main differences between K-means and GMM

Answer

A

K-means:
1. Minimises sum of squared distances
2. Hard assignment
3. Assumes spherical clustering
GMM:
1. Maximises log-likelihood
2. Soft assignment of points to clusters
3. Assumes elliptical clusters

Question 15

Q

How can we tell if our clustering is correct?

Answer

A

External validation -> Measure against externally supplied labels -> Use distance or incidence matrix
Internal validation -> Measure whether points that should be close/far from each other are actually so -> Use silhouette coefficient
Relative validation -> Compare one clustering method to another and see if they agree

Question 16

Q

What is the silhouette coefficient? What is the best and worst values for it?

Answer

Study These Flashcards

A

A coefficient used in internal validation method for cluster validation i.e checking distances between points are as they are expected to be,
Best = 1, worst = -1

L14 - K Means, GMM, Clustering Measures Flashcards

(16 cards)