Trees - cluster analysis Flashcards
What is the earliest quantitative method of tree construction
cluster analysis
What does cluster analysis look at
overall similarity - how much like each other are things
What is the assumption behind cluster analysis
species that share a most recent common ancestor should be more similar to each other than to any other species
How do you do a cluster analysis from a character matrix
you have to convert it to a similarity matrix or a dissimilarity/distance matrix which is known as p-distance or Hamming distance
What are some methods of performing a cluster analysis
- least squares method
- NJ (neighbor joining)
- UPGMA
What are the criticisms by cladists when it comes to a distance matrix for cluster analysis
- there is a loss of information: no distinction made between shared derived and shared primitive characteristics
The Mean character difference used for cluster analysis is also called what
- Manhattan squares or taxicab geometry –> you can find the hypotenuse of a triangle with these values (the hypotenuse is the Euclidian distance - think sqroot(character1 difference squared + other character distances squared)
The UPGMA method for clustering is usually attributed to what people
Sokal and Michener
What is a major problem with the UPGMA method
- it assumes that all groups evolve at the same rate - which is often not true (so this doesn’t account for unequal divergence rates?)
What clustering algorithms try to compensate for unequal divergence rates unlike UPGMA
- least square methods: Here the best tree is the one that minimizes the sum of the squared differences between the true Dij values and the ones predicted on the tree dij
- Neighbor joining method (saitou and Nei): this works by clustering but does not assume a clock. This seems to perform better than UPGMA
Describe the least square method
this is a clustering algorithm where the best tree is the one that minimizes the sum of the squared differences between the true Dij values (this is Euclidian distance values) and the ones predicted on the tree dij
What are the main criticisms of distance based approaches
some information about the data may be lost due to conversions: like going from character matrix to distance matrix and then to a tree
and the assumption of equal rates is questionable
What is the main advantage of distance based approaches
they are fast, and some methods like UPGMA and NJ can give a precise single answer