Data Clustering Flashcards

(27 cards)

1
Q

What is K-means Clustering?

A

When we seek to partition observations into a pre-specified number of clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does good K-means clustering look like?

A

When the observations are partitioned enough to which the total within-cluster variation, summed over all the K-clusters is as small as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do we define the total within-cluster variation?

A

The sum of all the pairwise squared Euclidian distances between the observations in the kth cluster divided by the total number of observations in the kth cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Hierarchical Clustering?

A

When we don’t know in advance how many clusters we want, so we end up in a tree-like figure called a dendrogram, allowing us to view at once the clustering obtained for each possible number of clusters, from 1 to “n”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a dendrogram?

A

Starts from leaves and combine clusters up to trunk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does each leaf in a dendrogram represent?

A

One of the observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the height of a fusion of two observations indicate?

A

The dissimilarity between two observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Will observations fused at the bottom be similar or different?

A

Similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Will observations fused at the top be similar or different?

A

Different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the height of the whole dendrogram represent?

A

The height of the dendrogram represents the number of clusters possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the cut in the dendrogram represent?

A

The number of clusters chosen

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is correlation-based distance?

A

Two observations to be similar if their features are highly correlated even though their Euclidean distance could be far apart. Focuses on the shapes of the observation profiles rather than the magnitudes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When can hierarchical clustering be worse than K-means?

A

When the true number of clusters is not nested within each other all the time, which if we go by the assumption that it is, we can yield unrealistic and inaccurate results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Difference between K-means and hierarchical clustering?

A

We have a pre-specified number of clusters for K-means, but for hierarchical we must create a dendrogram to figure out how many clusters we want

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Steps of K-means Clustering Algorithm:

A
  1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments
  2. Iterate until the cluster assignments stop changing
    - For each of the K clusters, compute the cluster centroid
    - Assign each observation to the cluster whose centroid is the closest (where closest is defined using Euclidian distance)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the purpose of the elbow method and how do we identify?

A

To determine the optimal number of clusters based on the sum of squares of many random clusters and we identify by picking the the major change in slope.

17
Q

Difference between a dendrogram and decision tree

A

Dendrograms are unsupervised while decision trees are supervised

18
Q

What is complete linkage?

A

Shows maximal inter-cluster dissimilarity, so we compute all the pairwise dissimilarities between observations in all the clusters and record the largest of these similarities

19
Q

What is single linkage?

A

Shows minimal inter-cluster dissimilarity, so we compute all the pairwise dissimilarities between observations in all the clusters and record the smallest of these similarities

20
Q

What is average linkage?

A

Shows mean inter-cluster dissimilarity, so we compute all the pairwise dissimilarities between observations in all the clusters and record the average of these similarities

21
Q

What is centroid linkage?

A

Dissimilarity between centroids for each cluster

22
Q

What two linkages are the most preferred and why?

A

Complete and average linkage because they yield more balanced dendrograms

23
Q

What two linkages are the least preferred and why?

A
  • Single because it can result in extended trailing clusters in which single observations are fused one at a time
  • Centroid because it can result in undesirable inversions which means the two clusters are fused at a height below either of the individual clusters in the dendrogram
24
Q

Describe the 3 main decisions that need to be made when performing hierarchical clustering:

A
  • The dissimilarity measure being used to determine the dissimilarity between observations
  • The linkage being used to expand the application of dissimilarity from individual observations to clusters of them
  • Placement of the horizontal cut in the dendrogram to determine the number of clusters
25
Why is scaling important?
When looking at observations, there might be one variable that’s having more of an effect than the other which can make the EDA less informative overall, so we need to scale each variable by its standard deviation to minimize the dissimilarities and make sure each variable has equal value and not a skewed effect on the data
26
Why are clustering methods not the most appropriate?
Does not consider or adjust for the presence of outliers so clusters may be heavily distorted
27
What are two ways of using clustering right?
- Perform clustering with a different set of parameters, to track what patterns constantly emerge - Cluster subsets of the data to get a sense of robustness of the clusters obtained