Clustering - k-means Flashcards

1
Q

Clustering Definition

A

Unsupervised learning method

Goal is to group similar observations in the same cluster and dissimilar observations in different clusters. There are many ways how dissimilarity can be defined, where Euclidean distance is a common one.

The original variables should be scaled prior to clustering (k-means) to prevent distortion from difference in scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

k-means Clustering - definition and steps

A

Algorithm seeks to minimize the total within-cluster variation, which is the pairwise dissimilarity between every pair of observations in each cluster. However, due to the initial cluster assignments being random, the algorithm can only guarantee a local minimum from one run. Repeated runs can help search for the global minimum.

Steps:
1. choose k, the number of clusters
2. Randomly assign a cluster to each observation
3. calculate the centroid of each cluster
4. for each observation, identify the closest centroid and reassign to that cluster
5. Repeat steps 3 and 4 until the cluster assignments stop changing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

k-means Clustering - notes

A

The elbow method can help determine an optimal value for k (# of clusters). It involves plotting the proportion of variance explained as a function of k and identifying where the plot resembles an elbow. The proportion is the ratio of the between sum of squares to the total sum of squares.
–When the marginal increase in this ratio is large, adding another cluster puts observations with materially different characteristics into different clusters. On the other hand, when the marginal increase in the ratio is small, adding another cluster could result in two or more clusters of observations with similar characteristics.

The total sum of squares refers to the total within-cluster variation when k=1.

The between sum of squares refers to the differences between the total sum of squares and the total within-cluster variation (at any k).

When select k = 2 -> we are saying that if we replace the two variables with a factor, we prefer a factor with 2 levels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Scaled data example

A

Since the plot presents unscaled LONGITUDE and LATITUDE, which do not contribute very much to determining the clusters, the groupings look random. In plot B, all three variables are on the same scale, so LONGITUDE and LATITUDE are considered meaningfully in determining clusters. Therefore, we can see a clear pattern between LONGITUDE and LATITUDE in plot B.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly