Data Clustering Flashcards by Nishi Patel

What is K-means Clustering?

When we seek to partition observations into a pre-specified number of clusters

How well did you know this?

Not at all

Perfectly

What does good K-means clustering look like?

When the observations are partitioned enough to which the total within-cluster variation, summed over all the K-clusters is as small as possible

How well did you know this?

Not at all

Perfectly

How do we define the total within-cluster variation?

The sum of all the pairwise squared Euclidian distances between the observations in the kth cluster divided by the total number of observations in the kth cluster

How well did you know this?

Not at all

Perfectly

What is Hierarchical Clustering?

When we don’t know in advance how many clusters we want, so we end up in a tree-like figure called a dendrogram, allowing us to view at once the clustering obtained for each possible number of clusters, from 1 to “n”

How well did you know this?

Not at all

Perfectly

What is a dendrogram?

Starts from leaves and combine clusters up to trunk

How well did you know this?

Not at all

Perfectly

What does each leaf in a dendrogram represent?

One of the observations

How well did you know this?

Not at all

Perfectly

What does the height of a fusion of two observations indicate?

The dissimilarity between two observations

How well did you know this?

Not at all

Perfectly

Will observations fused at the bottom be similar or different?

Similar

How well did you know this?

Not at all

Perfectly

Will observations fused at the top be similar or different?

Different

How well did you know this?

Not at all

Perfectly

What does the height of the whole dendrogram represent?

The height of the dendrogram represents the number of clusters possible

How well did you know this?

Not at all

Perfectly

What does the cut in the dendrogram represent?

The number of clusters chosen

How well did you know this?

Not at all

Perfectly

What is correlation-based distance?

Two observations to be similar if their features are highly correlated even though their Euclidean distance could be far apart. Focuses on the shapes of the observation profiles rather than the magnitudes.

How well did you know this?

Not at all

Perfectly

When can hierarchical clustering be worse than K-means?

When the true number of clusters is not nested within each other all the time, which if we go by the assumption that it is, we can yield unrealistic and inaccurate results

How well did you know this?

Not at all

Perfectly

Difference between K-means and hierarchical clustering?

We have a pre-specified number of clusters for K-means, but for hierarchical we must create a dendrogram to figure out how many clusters we want

How well did you know this?

Not at all

Perfectly

Steps of K-means Clustering Algorithm:

Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments
Iterate until the cluster assignments stop changing
- For each of the K clusters, compute the cluster centroid
- Assign each observation to the cluster whose centroid is the closest (where closest is defined using Euclidian distance)

How well did you know this?

Not at all

Perfectly

What is the purpose of the elbow method and how do we identify?

Study These Flashcards

To determine the optimal number of clusters based on the sum of squares of many random clusters and we identify by picking the the major change in slope.

Difference between a dendrogram and decision tree

Study These Flashcards

Dendrograms are unsupervised while decision trees are supervised

What is complete linkage?

Study These Flashcards

Shows maximal inter-cluster dissimilarity, so we compute all the pairwise dissimilarities between observations in all the clusters and record the largest of these similarities

What is single linkage?

Study These Flashcards

Shows minimal inter-cluster dissimilarity, so we compute all the pairwise dissimilarities between observations in all the clusters and record the smallest of these similarities

What is average linkage?

Study These Flashcards

Shows mean inter-cluster dissimilarity, so we compute all the pairwise dissimilarities between observations in all the clusters and record the average of these similarities

What is centroid linkage?

Study These Flashcards

Dissimilarity between centroids for each cluster

What two linkages are the most preferred and why?

Study These Flashcards

Complete and average linkage because they yield more balanced dendrograms

What two linkages are the least preferred and why?

Study These Flashcards

Single because it can result in extended trailing clusters in which single observations are fused one at a time
Centroid because it can result in undesirable inversions which means the two clusters are fused at a height below either of the individual clusters in the dendrogram

Describe the 3 main decisions that need to be made when performing hierarchical clustering:

Study These Flashcards

The dissimilarity measure being used to determine the dissimilarity between observations
The linkage being used to expand the application of dissimilarity from individual observations to clusters of them
Placement of the horizontal cut in the dendrogram to determine the number of clusters

Why is scaling important?

When looking at observations, there might be one variable that’s having more of an effect than the other which can make the EDA less informative overall, so we need to scale each variable by its standard deviation to minimize the dissimilarities and make sure each variable has equal value and not a skewed effect on the data

Why are clustering methods not the most appropriate?

Does not consider or adjust for the presence of outliers so clusters may be heavily distorted

What are two ways of using clustering right?

- Perform clustering with a different set of parameters, to track what patterns constantly emerge - Cluster subsets of the data to get a sense of robustness of the clusters obtained

Data Clustering Flashcards

(27 cards)