multi-dimensional scaling and Cluster analysis Flashcards
why are we covering both CA and MD techniques in one class?
because they are complimentary to eachother
can merge the two techniques and use them together
what is CA and MDS?
Two types of exploratory techniques
- help us to understand and locate structure and relationships in the data
- groups objects together based on their characteristics
- looks for patterns of information
whats the difference between factor analysis and cluster analysis/MDS
FA
- starts with individual variables and reduce these into dimensions of factors
- different ways to run factor analysis. -look at the correlation structure and try to reduce it using the factor loadings
- interpret where the dimensions are based on how individual variables load on these actors
cluster analysis/MDS
- start again with individual variables
- then determine which ones go together
difference - we don’t extract dimensions, instead just try to determine which variables in the dataset go together. this is something YOU do. you aren’t presented with extracted factors your’e only presented with patterns of how things might go together and then you decide which go together.
in which discipline would you use cluster analysis
Used in almost every discipline: psychology, neuroscience, biology etc.,
sometimes we need to sort variables together
the criteria we use to do the sorting will affect the outcome of the sorted variables
what is cluster analysis
Humans are good at identifying patterns - e.g., just looking ta the residual plot reveals a pattern
very difficult to identify patterns mathematically
CA provides you with information that you can use to identify what the patterns are. human-machine work together
what is a dissimilarity matrix
where the larger the number the more dissimilar our two objects (e.g., the di stance between two cities)
what is a similarity matrix and can you give an example of this
where the larger number indicates two objects are more similar e.g., a correlation table
what do we need to be aware of when running cluster analysis with regrads to the matrix
whether it is a dissimilarity or similarity matrix
what does cluster analysis actually do in terms of points of data
it puts points that are most similar together and pushes points most dissimilar apart
clusters things together
what different techniques are used to cluster things together
- k means clustering - non-hierarchical method. you decide in the beginning how many clusters you want. run it then get a suggested membership of data points to clusters
what is k means clustering?
a non-hierarchical clustering method
- we pick some starting cluster numbers - e.g., I want 3 clusters
- algorithm starts by randomly picking 3 cluster points in your data set
- at each step - clustering algorithm calculates the distance between each data point and the cluster center and assigns each datapoint membership to the cluster group nearest
- THEN - cluster center is moved by a certain algorithm - calculates whether this improved the distance measure between all data points and the cluster center
so the goal is to do iterative procedure to
- find the cluster center
- having the goal number (e.g., 3)
- and find the position of those cluster centers that will minimise the distance of all data points that could be assigned to that cluster
Explain what’s happening in this k means clustering shite
Well a cluster of 3 has been identified, the three points have been shifted 4 times to find the ideal location for data points closest to the cluster
with k means clustering what is shifted around the screen - the data points to find those fitting best to the clusters OR the cluster points moving until the data points are closest to it
The data points STAY PUT its the cluster point that shifts bit by bit and stops when the data points for the desired number of clusters are closest
with k means clustering, if the cluster centeroid shifts far enough, is it possible for data points to be assigned a different cluster membership
yes
when does k means clustering stop
when any further change in the cluster center doesn’t reduce the differences anymore.
what does the p-value in cluster analysis tell us
there is no p-value nor end statistic of any sort. you are only presented with, e.g., for. k means clustering, a suggestion of cluster membership for different data points
describe non-hierarchial cluster analysis
non-hierarchial methods
- where clusters are formed by assigning membership to clusters
- you decide how many clusters you want before the analysis e.g., k means cluster
- individual data poitns are assigned one of them according to some particular criteria
in non hierarchial cluster analysis how might you decide on the number of clusters?
- have a certain theory
- use previous literature - look the number they used
- run it with varying numbers e.g., 2-5 then see which one gives the most reasonable cluster groups
hierarchial method for cluster analysis: what are the two groups?
- agglomorative method
- divisive method
in any hierarchical method it goes from 1 to many clusters or many to 1. typically presented either a dendrogram or an icicle plot. then YOU determine themeaningful number of clusters is using a cut off.
in both cases you get a tree diagram (Dendrogram) and an icicle plot -helpful in deciding a feasable cut off point
hierarchial cluster analysis: agglomerative methods
different types: single link (nearest neighbour), maximum link (furthest neighbour) or average link (centeroid clustering) - they differin the way they compute the distances
- start by treating each data point as a one-member cluster
- then proceed to put things together - agglomerate clusters
- once a pair of object shave been put together - cant split them up again
- means new clusters are formed based on clusters already created at a previous step
hierarchical cluster analysis: devisive methods
- treat all data points as one giant cluster
- then split things up - once a pair has been seperated they can never join again
what is the single link aka nearest neighbour technique
one method of hierarchial agglomerative clustering
- start with each city by itself
- then start amalgamating them
- looks at the data finds the points with the closest relationship to each other (durham-subderland) and group these together in a cluster
- then distance matrix recalculated and finds the cities that are next in line closest together (including the Durham-Sunderland cluster as one)
single link aka nearest neighbour technique
look at the dendrogram and name are the cluster groups in order
- durham and sunderland
- exeter and plymouth
- birmingham + (Exeter + plymouth)
horizontal axis is a measure. of the relative proximity of the variables e.g., relationship between Durham and sunderland is closer than the relationship between exeter and plymouth. knowing the relative distance between cities can help you to create a cut off point (e.g., cut off point at about 3 in the x axis number line would give. usonly 1 cluster, nut if it was at 24 we would have 3