L14 - K Means, GMM, Clustering Measures Flashcards

1
Q

Hierarchical clustering is a deterministic method. What does this mean?

A

We can repeat the method and get the same results. Generating the same dendrogram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

What is the issue with Hierarchical clustering?

A

Slow -> Can hit O(N^3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are ‘partition clustering methods’?

A
  1. Partitions dimensional space as opposed to clustering points.
    1. Breaks space into K partitions that are non-overlapping
    2. Then we can compare N data-points to K clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is K-means clustering?

A

A partition clustering method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the process of K-means clustering…

A
  1. Input N data points and K cluster
    1. Choose K random data points to be the initial cluster centroids
    2. Assign every point in N to it’s nearest K centroid
    3. Recompute the centroid of each cluster to be the actual centroid
    4. Check for convergence or termination criteria
    5. Repeat 3 to 5
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some options for the terminations criteria?

A
  1. Few or no re-assignment of data points to different clusters
    1. Few or no change to centroids
    2. Minimal change in sum of square errors (SSE)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the pros and cons of using K-means?

A
  1. Pros:
    1. Easy to understand and implement
    2. Efficient -> O(ndk*t) -> O(n) i.e Linear complexity
  2. Cons:
    1. Requires a centroid to be calculable -> Not possible with categorical data
    2. Effectiveness dependent on selection of K
    3. Sensitive to outlier
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the limitations of K-means?

A
  1. Struggles with clusters of different sizes
  2. Struggles with clusters of different shapes
  3. Struggles with clusters of different density
  4. Results depend on starting centroids -> Same data can be clustered completely differently with different initial centroids.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do we overcome the limitations of K-means?

A

Pre-processing -> Eliminating outliers or standing or normalising the data.

Post processing -> Merge clusters with low SSE, Split loose clusters, Eliminate outlier clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What method can we use to choose K?

A

Elbow method -> Plot SSE for multiple values of K and find the elbow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Since K-means is sensitive to outliers, what clustering method can be used as an alternative?

A

GMM -> Gaussian Mixture Models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is GMM? What issues of K-means does it solve?

A
  1. A clustering method that clusters based on probabilities
  2. E.g A is 30% clusters blue and 70% cluster green
  3. Solves issue in K-means regarding cluster outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the pros and cons of GMM compared to K-means?

A
  1. Pros:
    1. Convergence quicker
    2. Deals well with outliers
  2. Cons:
    1. More complex
    2. More computationally expensive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Give main differences between K-means and GMM

A
  1. K-means:
    1. Minimises sum of squared distances
    2. Hard assignment
    3. Assumes spherical clustering
  2. GMM:
    1. Maximises log-likelihood
    2. Soft assignment of points to clusters
    3. Assumes elliptical clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can we tell if our clustering is correct?

A
  1. External validation -> Measure against externally supplied labels -> Use distance or incidence matrix
  2. Internal validation -> Measure whether points that should be close/far from each other are actually so -> Use silhouette coefficient
  3. Relative validation -> Compare one clustering method to another and see if they agree
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the silhouette coefficient? What is the best and worst values for it?

A
  1. A coefficient used in internal validation method for cluster validation i.e checking distances between points are as they are expected to be,
  2. Best = 1, worst = -1