Cluster Analysis Flashcards by Camille Leduc-Brunet

Is cluster analysis a supervised or an unsupervised learning method?

Unsupervised learning method

How well did you know this?

Not at all

Perfectly

Contrast cluster analysis with PCA

PCA: looks for a low dimensional representation that explains most of the variance
Cluster Analysis: tries to group the observations into a small number of groups of similar observations

How well did you know this?

Not at all

Perfectly

Give examples of application of cluster analysis

. Marketing
. Medical
. Actuarial modeling

How well did you know this?

Not at all

Perfectly

Give two clustering methods

. K-means

. Hierarchical

How well did you know this?

Not at all

Perfectly

Define K-means clustering

Group the observations in a data set into K disjoint clusters in which the observations are relatively homogenous.
We decide the number of cluster in advance, they are exhaustive (every observation belongs to one cluster) and mutually exclusive.
The clusters are selected to minimize the total dissimilarities between points within the cluster

How well did you know this?

Not at all

Perfectly

What is the centroid of a cluster and how can you compute it?

Point whose coordinates are the means of the coordinates of the cluster
Formula: ( (sum(xi)/n ; sum(yi)/n )

How well did you know this?

Not at all

Perfectly

What is the algorithm for K-means clustering?

Split the observations arbitrarily into K clusters
Calculate the centroids for each cluster
Create new clusters by associating each points with the nearest centroid
Repeat steps 2 and 3 until the clusters don’t change

How well did you know this?

Not at all

Perfectly

What is the dissimilarity function for k-means clustering?

Twice the Euclidean distance of points from the centroid.

Assigning points to the closest centroid can only reduce the total dissimilarity.

How well did you know this?

Not at all

Perfectly

Define hierarchical clustering

Consists of a series of fusions of observations results in bigger clusters containing smaller clusters containing smaller clusters.
Does not require to specify the number of clusters in advance

How well did you know this?

Not at all

Perfectly

Describe bottom-up / agglomerative clustering

The algorithm starts with n clusters (1 for each observation) and iteratively fuses the two most similar clusters together until all points are in one cluster.

How well did you know this?

Not at all

Perfectly

How do you decide how many clusters should be used in bottom-up clustering?

A dendrogram is produced. The number of clusters is determined by deciding at what height to cut the graph.

How well did you know this?

Not at all

Perfectly

What is a dissimilarity measure?

A formula that measures how different two points are.

How well did you know this?

Not at all

Perfectly

What is the most common dissimilarity measure?

Euclidean distance (square root of the sum of square differences between coordinates)

How well did you know this?

Not at all

Perfectly

In what type of cluster analysis do we square the Euclidean distance?

K-means only

How well did you know this?

Not at all

Perfectly

Why can’t we square the Euclidean distance

in hierarchical clustering, a dendrogram in which the height of each fusion is the dissimilarity measure. We don’t want to spoil the scale of the graph by squaring the distance.

How well did you know this?

Not at all

Perfectly

How does the fusion algorithm work?

Study These Flashcards

At each iteration, we compare every paire of clusters. If there is K clusters, we male k(k-1)/2 comparaisons.
We select the paire of clusters with the smallest dissimilarity measure and fuse them

A dissimilarity between two points is defined by the Euclidean distance (or the square of the Euclidean distance). How is the dissimilarity between two cluster defined?

Study These Flashcards

The dissimilarity between two clusters is called linkage.

What are the different type of linkage?

Study These Flashcards

Complete
Single
Average
Centroid

Define complete linkage

Study These Flashcards

Define single linkage

Study These Flashcards

Calculate the dissimilarity between every points of cluster A and every points of cluster B (is there are a points in cluster A and b points in cluster b, do axb calculations). The dissimilarity is the minimum of these numbers.
This linkage leads to trailing clusters (clusters in which one point at a time is fused to a single cluster)

Define average linkage

Study These Flashcards

Define centroid linkage

Study These Flashcards

Calculate the centroid of each clusters and use the dissimilarity between the centroids.

What is the disadvantage of centroid linkage?

Study These Flashcards

Can lead to inversion : a later fusion occurs at a height lower than an earlier fusion - the dissimilarity of a later fusion is lower than the dissimilarity of an earlier fusion involving the same points.

True or false : the first link is the same regardless of linkage

Study These Flashcards

True

Why does complete linkage usually prefer to fuse smaller cluster together first?

The more points, the higher the maximum tends to be and the fusion is based on the minimum of the measure used.

Why does single linkage usually prefer to fuse groups with greater number of observations first?

Opposite to complete linkage: the more observation there is in a cluster, the lower the minimum tends to be.

Dose average linkage have a bias towards linking smaller of bigger groups first?

What is the issue in hierarchical clustering?

It assumes that there is a hierarchy in the variables. If they are unrelated, this may not be a valid assumption. K-means clustering will be superior in those cases.

What is another dissimilarity measure that can be used in hierarchical clustering?

Correlation

Why does it take at least 3 features to use correlation as a dissimilarity measure?

Any 2 numbers are perfectly correlated with any other 2 numbers.

Scale of variable is important. What action can be taken to ensure scale is considered?

One may standardize all variables to have standard deviation of 1

Is clustering robust?

No. A small deviation in data can result in different clusters.

True or false: one may perform clustering analysis on PCA score vectors instead of on the original data.

True

Cluster Analysis Flashcards

(33 cards)