Week 13: More advanced Methods - Cluster Analysis Flashcards

1
Q

What is an exploratory data analysis tool for organizing observed data into meaningful clusters, based on combinations of variables?

A

Cluster analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Example of When to look at grouping (cluster) patterns:

A
  1. A PT practitioner would like to group patients according to their attributes in order to better treat them with personalized care plan
  2. A PT practitioner would like to classify patients based on their individual health records in order to develop specific management strategies that are appropriate to the patients
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Hierarchical clustering -

A
  • a set of nested clusters organized using a hierarchical tree
  • the clustering is mapped into a hierarchy basing its grouping on the inter-cluster similarities or dissimilarities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Non-hierarchical clustering -

A
  • a group of individuals into non-overlapping subsets (clusters) such that each object is in exactly one cluster
  • Divide a dataset of n individuals into m clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the most commonly used non-hierarchical technique?

A

K-mean clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are 3 types of clustering techniques?

A
  1. Hierarchical clustering
  2. K-mean clustering
  3. Two-step clustering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Bottom-up or agglomerative hierarchical clustering -

A

starts with one single piece of datum and then merge it with others to form larger groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Top-down or divisive hierarchical clustering -

A

starts with all in one group and then partition the data step by step using a flat clustering algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Agglomerative hierarchical clustering procedure:

A

Step 1: Assign each item to a cluster
Step 2: Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less
Step 3: Compute distances (similarities) between the new cluster and each of the old clusters
Step 4: Repeat steps 2 and 3 until all items are clustered into a single cluster of the original sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

3 Limitations of hierarchical clustering:

1. Arbitrary decisions

A
  • necessary to specify both the distance metric and the linkage criteria without any strong theoretical basis for such decisions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

3 Limitations of hierarchical clustering:

2. Data types

A
  • works well with continuous data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

3 Limitations of hierarchical clustering:

3. Misinterpretation of dendrogram

A
  • selecting the number of clusters using dendrogram may mislead
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

K-mean clustering -

A

clustering algorithm where data is classified into K number of clusters this is the most widely used clustering method each individual data is mapped into the cluster with its nearest mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Procedure for k-mean clustering:

A

Step 1: Select K points as the initial centroids
Step 2: Assign points to different centroids based on proximity
Step 3: Re-evaluate the centroid of each group
Step 4: Repeat Steps 2 and 3 until the best solutions emerges (the centers are stable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Limitations of k-mean clustering:

A
  • K-mean is subjective
    1. The researcher chooses the number of clusters
    2. More Ks (number of clusters), shorter distance from the centroid
    3. As an extreme scenario:
  • When every data point is a centroid, the distance is zero.
  • But it is useless
    4. What is the optimal K?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is two-step clustering?

A

hybrid approach where we run pre-clustering first and then run hierarchical methods (this is why it has this name)

17
Q

What 3 features differentiate two-step clustering from traditional clustering techniques?

A
  1. the ability to create clusters based on both categorical and continuous variables
  2. automatic selection of the number of clusters
  3. the ability to analyze large data set efficiently
18
Q

Procedure of two-step clustering:

A

Step 1: A sequential approach is used to pre-cluster the cases by condensing the variables (pre-clustering)
Step 2: The pre-clusters are statistically merged into the desired number of clusters (clustering)

19
Q

What 2 limitations can two-step clustering overcome?

A
  1. It can take both continuous and categorical data
  2. There is no need to enter the number of clusters a priority because it uses indexes of fit (AIC or BIC) to compare each cluster solution to determine which number of cluster is best
20
Q

Cluster quality validation index :

Silhouette coefficient -

A

it measures how well an individual data is clustered and it estimates the average distance between clusters

21
Q

Cluster quality validation index :

Silhouette plot -

A

it displays a measure of how close each point in one cluster is to points in the neighboring cluster

22
Q

Interpretation with Silhouette coefficient:

Large Silhouette coefficient value of almost 1 -

A

very well clustered

23
Q

Interpretation with Silhouette coefficient:

negative Silhouette coefficient value -

A

probably placed in the wrong cluster

24
Q

Interpretation with Silhouette coefficient:

small Silhouette coefficient value of around 0 -

A

lies between two clusters

25
Q

Silhouette value of 0.5-1 =

A

Good

26
Q

Silhouette value of .2-.5 =

A

Fair

27
Q

Silhouette value of -1 to .2 =

A

Poor

28
Q

application of cluster analysis involves what?

A

grouping similar cases into homogenous groups (called clusters) when the grouping is not previously known

29
Q

With hierarchical clustering -

A

the clustering is mapped into a hierarchy basing its grouping on the inter-cluster similarities or dissimilarities

30
Q

With k-mean clustering -

A

data is classified into K number of clusters mapping each individual data into the cluster with its nearest mean

31
Q

With two-step clustering -

A

a sequential approach is first used to pre-cluster the cases, and second the pre-clusters are statistically merged into the desired number of clusters

32
Q

Why might Two step clustering may be a better choice over hierarchical or k-mean?

A

the two step clustering can work with categorical data and it is not bound to an arbitrary choice of the number of clusters