Cluster Analysis Flashcards
Is cluster analysis a supervised or an unsupervised learning method?
Unsupervised learning method
Contrast cluster analysis with PCA
PCA: looks for a low dimensional representation that explains most of the variance
Cluster Analysis: tries to group the observations into a small number of groups of similar observations
Give examples of application of cluster analysis
. Marketing
. Medical
. Actuarial modeling
Give two clustering methods
. K-means
. Hierarchical
Define K-means clustering
Group the observations in a data set into K disjoint clusters in which the observations are relatively homogenous.
We decide the number of cluster in advance, they are exhaustive (every observation belongs to one cluster) and mutually exclusive.
The clusters are selected to minimize the total dissimilarities between points within the cluster
What is the centroid of a cluster and how can you compute it?
Point whose coordinates are the means of the coordinates of the cluster
Formula: ( (sum(xi)/n ; sum(yi)/n )
What is the algorithm for K-means clustering?
- Split the observations arbitrarily into K clusters
- Calculate the centroids for each cluster
- Create new clusters by associating each points with the nearest centroid
- Repeat steps 2 and 3 until the clusters don’t change
What is the dissimilarity function for k-means clustering?
Twice the Euclidean distance of points from the centroid.
Assigning points to the closest centroid can only reduce the total dissimilarity.
Define hierarchical clustering
Consists of a series of fusions of observations results in bigger clusters containing smaller clusters containing smaller clusters.
Does not require to specify the number of clusters in advance
Describe bottom-up / agglomerative clustering
The algorithm starts with n clusters (1 for each observation) and iteratively fuses the two most similar clusters together until all points are in one cluster.
How do you decide how many clusters should be used in bottom-up clustering?
A dendrogram is produced. The number of clusters is determined by deciding at what height to cut the graph.
What is a dissimilarity measure?
A formula that measures how different two points are.
What is the most common dissimilarity measure?
Euclidean distance (square root of the sum of square differences between coordinates)
In what type of cluster analysis do we square the Euclidean distance?
K-means only
Why can’t we square the Euclidean distance
in hierarchical clustering, a dendrogram in which the height of each fusion is the dissimilarity measure. We don’t want to spoil the scale of the graph by squaring the distance.
How does the fusion algorithm work?
At each iteration, we compare every paire of clusters. If there is K clusters, we male k(k-1)/2 comparaisons.
We select the paire of clusters with the smallest dissimilarity measure and fuse them
A dissimilarity between two points is defined by the Euclidean distance (or the square of the Euclidean distance). How is the dissimilarity between two cluster defined?
The dissimilarity between two clusters is called linkage.
What are the different type of linkage?
- Complete
- Single
- Average
- Centroid
Define complete linkage
Calculate the dissimilarity between every points of cluster A and every points of cluster B (is there are a points in cluster A and b points in cluster b, do axb calculations). The dissimilarity is the maximum of these numbers.
Define single linkage
Calculate the dissimilarity between every points of cluster A and every points of cluster B (is there are a points in cluster A and b points in cluster b, do axb calculations). The dissimilarity is the minimum of these numbers.
This linkage leads to trailing clusters (clusters in which one point at a time is fused to a single cluster)
Define average linkage
Calculate the dissimilarity between every points of cluster A and every points of cluster B (is there are a points in cluster A and b points in cluster b, do axb calculations). The dissimilarity is the average of these numbers.
Define centroid linkage
Calculate the centroid of each clusters and use the dissimilarity between the centroids.
What is the disadvantage of centroid linkage?
Can lead to inversion : a later fusion occurs at a height lower than an earlier fusion - the dissimilarity of a later fusion is lower than the dissimilarity of an earlier fusion involving the same points.
True or false : the first link is the same regardless of linkage
True