Data Clustering pt. 1 Flashcards

Question 1

Q

What are two methods that are unsupervised?

Answer

A

Data Clustering and Principle Components Analysis

Question 2

Q

What is the goal of data clustering?

Answer

A

Seek to partition observations into distinct groups so that the observations within each group are similar to one another and ones in different groups are different

Question 3

Q

K-Means Clustering

Answer

A

Partition observations in a pre-specified number of clusters

Question 4

Q

Hierarchical Clustering

Answer

A

Don’t know in advance how many clusters we want, so we end up in a tree-like figure (dendrogram) of observations, which allows us to see the clusters obtained for each possible number of clusters, from 1 to n

Question 5

Q

Properties of observations in K-Means Clustering

Answer

A

Each observation belongs to at least one of the K clusters
The clusters are non-overlapping: no observation belongs to more than one cluster

Question 6

Q

What does good clustering look like?

Answer

A

Good clustering is when the total within-cluster variation summed over all clusters is minimized

Question 7

Q

What method is used to find the optimal number of clusters for K-mean s?

Answer

A

The “ELBOW” method

Question 8

Q

Explain the ELBOW method?

Answer

A

Run initial K-means clustering from 1 to some large number
Calculate the sum of squares differences across each cluster and graph them
Look for the change in slope from steep to shallow and that is the optimal amount

Question 9

Q

How do we interpret a dendrogram?

Answer

A

Each leaf of the dendrogram represents one of the observations
Height of the fusion of two observations (measured on the vertical axis is) indicates how different two observations are
Observations fusing to the bottom are similar to each other while fusions at the top are different

Question 10

Q

How doe we identifying clusters on the basis of a dendrogram?

Answer

A

Make a horizontal cut across the dendrogram and the distinct set of observations beneath the cut can be interpreted as clusters. Look for clusters with the longest branches

Question 11

Q

How can hierarchical clustering yield worse results sometimes?

Answer

A

When the dataset we are clustering is not truly nested with categories, so cutting to find clusters assumes the data is nested when this can be unrealistic

Data Clustering pt. 1 Flashcards

(11 cards)