Data Clustering pt. 1 Flashcards
(11 cards)
What are two methods that are unsupervised?
Data Clustering and Principle Components Analysis
What is the goal of data clustering?
Seek to partition observations into distinct groups so that the observations within each group are similar to one another and ones in different groups are different
K-Means Clustering
Partition observations in a pre-specified number of clusters
Hierarchical Clustering
Don’t know in advance how many clusters we want, so we end up in a tree-like figure (dendrogram) of observations, which allows us to see the clusters obtained for each possible number of clusters, from 1 to n
Properties of observations in K-Means Clustering
- Each observation belongs to at least one of the K clusters
- The clusters are non-overlapping: no observation belongs to more than one cluster
What does good clustering look like?
Good clustering is when the total within-cluster variation summed over all clusters is minimized
What method is used to find the optimal number of clusters for K-mean s?
The “ELBOW” method
Explain the ELBOW method?
- Run initial K-means clustering from 1 to some large number
- Calculate the sum of squares differences across each cluster and graph them
- Look for the change in slope from steep to shallow and that is the optimal amount
How do we interpret a dendrogram?
- Each leaf of the dendrogram represents one of the observations
- Height of the fusion of two observations (measured on the vertical axis is) indicates how different two observations are
- Observations fusing to the bottom are similar to each other while fusions at the top are different
How doe we identifying clusters on the basis of a dendrogram?
Make a horizontal cut across the dendrogram and the distinct set of observations beneath the cut can be interpreted as clusters. Look for clusters with the longest branches
How can hierarchical clustering yield worse results sometimes?
When the dataset we are clustering is not truly nested with categories, so cutting to find clusters assumes the data is nested when this can be unrealistic