W8: Cluster Analysis Flashcards

1
Q

Cluster analysis can be used to classify

A

individuals and separate them for further study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In cluster analysis we find groups of (2)

A

similar individuals based on their covariate information
These groups are known as clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Aim to extract a small number of cluster of individuals who share similar characteristics and who have

A

different characteristics than those in other clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can we measure the degree of similarity between individual’s scores across a number of variables?

A

Using 2 measures: similarity coefficients and dissimilarity coefificents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The correlation coefficient ,r, is a measure of

A

similarlity between 2 variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the Pearson’s correlation correlation tell us

A

whether 1 variable changes the other by a similar amount

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

We could use the Pearson’s correlation coefficient, r, to work out the correlation between 2

A

individuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

We could use the Pearson’s correlation coefficient, r, to work out the correlation between 2

A

individuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

However, although the correlation tells us whether the pattern of responses between people are similar

It does not tell us anything about the

A

distance between 2 individual profiles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

An alternative measure compared to Pearson’s correlation coefficient , r, is

A

Euclidean distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Euclidean distance?

A

Geometric distance between 2 individuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The Euclidean distance for individual i and j formula given below:

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

With Euclidean distance, the smaller the distance

A

the more similar the individuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Euclidean distances are heavily affected by

A

variables with large size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

So if cases are being compared across variables that have different variances, then

A

Euclidean distances will be inaccurate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In such case that Euclidean distances will be inaccurate if cases compared have different variances then (2)

A

May standardise the scores by subtracting the mean of each variable and dividng by SD

(value - mean)/SD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How to calculate SD on your calculator? (8)

A

Mode
Statistics (2)
1
Input values
OPTN
1 variable (3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Most methods of grouping individuals based on similarlity are done in hierarchical way (2)

A
  1. Begin all individuals treated as one cluster
  2. At each subsequent stage clusters merged based on Euclidean distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

For example calculating Eucliden distances we first say

A

We calculate the Eucliden distance between the individuals

20
Q

In Eucliden distance, n represents

A

number of variables

21
Q

What is n here in calculating Euclidean distance?

22
Q

For first individuals lets say A and B we say in the formula before calculating:

A

For A and B:

23
Q

At the end of Euclidean distance
lets say A and B = 3.26
A and C = 2.13
B and C = 1.23

Ending sentence:

A

So we would cluster B and C first, then A and C and finally A and B

24
Q

Consider 3 ways of merging clusrers at each step of method (3)

A
  1. Nearest neighbour
  2. Furthest neighbour
  3. Average linkage
25
What is nearest neighbour?
add individuals to clusters one at a time based on the lowest Euclidean distance to any member of the cluster.
26
What is furthest neighbour?
add individuals to clusters one at a time based on the lowest Euclidean distances to all members of the cluster.
27
What is average linkage?
add individuals to clusters one at a time based on the lowest average Euclidean distance to the cluster.
28
What is a dendogram?
It shows how and when individuals are combined in the clustering algorithm
29
Interpret this denodgram that used nearest neighbour clustering (5)
We see clusters forming: First one is between individuals 1,4,11,7 and 13 Second one is between individuals 10,12,9,15 and 2 Third one is between 6,8,5,3 and 14 First and second clusters are most closely linked than the third cluster
30
In this table, interpret it: Measured trait anxiety, depression, intrusive thoughts, impulsive Patients with same disorder should report similar pattern across scores We asked 2 psychologist to agree on diagnosis of GAD, depression or OCD
We see an exact mapping between the 3 clusters and the diagnosis of the psychologists in spite of clustering algorithm having no knowledge of the diagnoses
31
Interpret the FN as compared to NN
We end up with same 3 clusters although individuals are combined in different order
32
Which of the methods provides a more useful dendogram? Justify your answer (2)
The furthest neighbour clustering provides a more useful dendogram. It forms a more natural set of clusters than the nearest neighbour clustering algorithm, which produces lots of clusters of size 1.
33
Using dendogram you selected (a) what distance should we cut the dendogram? (B) What is the memberships of the resulting clusters? - (5)
Using the dendogram from the furthest neighbour clustering, we would cut the dendogram somwhere between a distance of around 15 and 16. The resulting cluster memberships would be Cluster a: Ernie, Carla, Christa, Ernest, Christopher, Beulah, Linette, Marie, Bo. Cluster b: Tony, Martina, Randolph, Raul, Catalina, Louis, Sunila, Johnson, Mickey. Cluster c: Rosalyn, Lawrence
34
We have collected data on the salary, rank (from 1 to 5, where 5 is most senior), FTE (hours worked, where 1 is full time), number of articles published and years of experience of 20 university staff. (c) Discuss cluter memberships in list of summary statistics and plots (4)
From the boxplots and classification table we can see that: Cluster 1 is made up of individuals with a high salary, a large number of articles and many years of experience, all of whom are professors. It appears that cluster 1 is the most senior academics. This correspondes to cluster c above. Cluster 2 is made up of individuals with a lower salary, fewer articles and less experience than cluster 1. However, most are still professors. Therefore, cluster 2 is made up of a slightly lower level of senior academics. Cluster 3 is made up of individuals who have relatively low salaries, few articles and less experience, and who are not yet full professors. That is, cluster 3 is made up of early career academics.
35
In SPSS what does this table tell us?
- This table tells us which cluster each of my individuals belong to in the cluster membership table
36
In SPSS how many clusters there are in this dendogram?
3 clusters
37
Clusters in R Dendogram how may clusters?
3 clusters
38
Since variables are on the same scale there is no need to
standardise them
39
Interpret this dendogram - (2) dendogram shows different aspect of disgust - FN
Three broad clusters appear, which seem to be distinct types of disgust. We may want to further disaggregate into up to seven clusters.
40
Interpret this dendogram - (2) dendogram shows different aspect of disgust - NN
Clustering of bread, deer, crisp and foxes, but other clustering is hard to distinguish.
41
Which of the two clustering methods used appears to provide the more interpretable clusters? from different aspect of disgust?
Furthest Neighbour
42
Data on 7 different measurements of 41 cities (c) What do you observe from NN dendogram?
The clustering is a little tricky to make out, although Chicago appears different to the other cities.
43
Now perform furthest neighbour clustering on the data. What conclusions can you make this time? Are they different to the nearest neighbour algorithm?
This time, two/three large clusters appear, plus a cluster with just Chicago.
44
Based on the furthest neighbour clustering, find the cluster membership for a 3 cluster solution. Are Wichita and St Louis in the same cluster? Which city is the “odd one out”?
Wichita is in cluster 1 and St Louis in cluster 2. Chicago is the odd one out.
45
Interpret this dendogram - is it similar or different to previous NN/FN
The results are different again. Chicago is clustered with Philadelphia this time.