K-Means, GMM, Cluster Validation Flashcards

1
Q

What is an issue with K-Means that GMM solves?

A

K-Means is hard clustering, which causes inflexibility.

GMM is soft clustering and bases classification on probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give the steps for K-Means

A
  1. Choose K centroids in the data space
  2. Classify data points to the nearest centroid
  3. Recompute the centroids to the centre of the clusters
  4. Calculate the variance of each cluster
  5. Repeat until termination criteria is met.W
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the termination criteria for K-Means?

A
  • No more data reallocations
  • No change in centroid position
  • No change in squared error
  • N iterations are conducted
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the time complexity of K-Means?

A

O(NK) -> O(N)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some cons of using K-Means?

A
  • Doesn’t handle outliers well.
  • Effected by changes in shape, size and density of clusters.
  • Hard clustering -> Points can only be in one cluster.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain how GMM works…

A
  • A soft clustering algorithm that gives probabilities of each point belonging to a cluster.
  • Each cluster has a gaussian function
  • Data point X is run through the gaussian function to establish probability that the data point belongs to that cluster.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain what the distance matrix is in cluster validation…

A

A matrix that represents the euclidean distance between every data point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain what the incidence matrix is in cluster validation…

A

A matrix that highlights clusters. Clustered elements hold a value of 1, non-cluster elements are 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the 3 types of validation? Define each…

A

External: Compares clustering to externally supplied and labeled clusters. Uses Distance and Incidence matrix.

Internal: Conducts assessment internally via cluster cohesion and separation metrics.

Relative:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define the 2 measures of internal validation…

A

Cohesion : How closely related all the data points within a cluster are.

Separation : The distance between separate clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the silhouette score measure? What is the best score?

A

Similarity of an object to it’s own cluster. Measure for every object in a cluster and graph on a bar chart to establish cluster cohesion and separation.

Best = 1, worst = -1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly