Data Clustering Flashcards
(27 cards)
What is K-means Clustering?
When we seek to partition observations into a pre-specified number of clusters
What does good K-means clustering look like?
When the observations are partitioned enough to which the total within-cluster variation, summed over all the K-clusters is as small as possible
How do we define the total within-cluster variation?
The sum of all the pairwise squared Euclidian distances between the observations in the kth cluster divided by the total number of observations in the kth cluster
What is Hierarchical Clustering?
When we don’t know in advance how many clusters we want, so we end up in a tree-like figure called a dendrogram, allowing us to view at once the clustering obtained for each possible number of clusters, from 1 to “n”
What is a dendrogram?
Starts from leaves and combine clusters up to trunk
What does each leaf in a dendrogram represent?
One of the observations
What does the height of a fusion of two observations indicate?
The dissimilarity between two observations
Will observations fused at the bottom be similar or different?
Similar
Will observations fused at the top be similar or different?
Different
What does the height of the whole dendrogram represent?
The height of the dendrogram represents the number of clusters possible
What does the cut in the dendrogram represent?
The number of clusters chosen
What is correlation-based distance?
Two observations to be similar if their features are highly correlated even though their Euclidean distance could be far apart. Focuses on the shapes of the observation profiles rather than the magnitudes.
When can hierarchical clustering be worse than K-means?
When the true number of clusters is not nested within each other all the time, which if we go by the assumption that it is, we can yield unrealistic and inaccurate results
Difference between K-means and hierarchical clustering?
We have a pre-specified number of clusters for K-means, but for hierarchical we must create a dendrogram to figure out how many clusters we want
Steps of K-means Clustering Algorithm:
- Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments
- Iterate until the cluster assignments stop changing
- For each of the K clusters, compute the cluster centroid
- Assign each observation to the cluster whose centroid is the closest (where closest is defined using Euclidian distance)
What is the purpose of the elbow method and how do we identify?
To determine the optimal number of clusters based on the sum of squares of many random clusters and we identify by picking the the major change in slope.
Difference between a dendrogram and decision tree
Dendrograms are unsupervised while decision trees are supervised
What is complete linkage?
Shows maximal inter-cluster dissimilarity, so we compute all the pairwise dissimilarities between observations in all the clusters and record the largest of these similarities
What is single linkage?
Shows minimal inter-cluster dissimilarity, so we compute all the pairwise dissimilarities between observations in all the clusters and record the smallest of these similarities
What is average linkage?
Shows mean inter-cluster dissimilarity, so we compute all the pairwise dissimilarities between observations in all the clusters and record the average of these similarities
What is centroid linkage?
Dissimilarity between centroids for each cluster
What two linkages are the most preferred and why?
Complete and average linkage because they yield more balanced dendrograms
What two linkages are the least preferred and why?
- Single because it can result in extended trailing clusters in which single observations are fused one at a time
- Centroid because it can result in undesirable inversions which means the two clusters are fused at a height below either of the individual clusters in the dendrogram
Describe the 3 main decisions that need to be made when performing hierarchical clustering:
- The dissimilarity measure being used to determine the dissimilarity between observations
- The linkage being used to expand the application of dissimilarity from individual observations to clusters of them
- Placement of the horizontal cut in the dendrogram to determine the number of clusters