VL 12 Flashcards

1
Q

What is Clustering? and why do you use it?

A

Clustering is a technique that groups similar data points together based on their similarities, without using predefined categories. It helps find patterns and structures within datasets for various applications.

  • need for generalisation, grouping, classification
  • 1, 2 variables –> humans can cluster (plots)
  • 3, 4, … (coplot, image, pairs, PCA, MDS) –> computers should better do the clustering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

types of clustering

A
  • Hierarchical Clustering:
    A set of nested clusters organized as a hierarchical tree
    –> we will get a dendrogram and a cluster id by
    dendrogram cutting
  • Partitional Clustering: A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset
    –> we will only get a cluster id
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is hierarchical clustering?

A
  • no need to specify numbers of cluster k before clustering starts
  • the algorithm constructs a tree like hierarchy (dendrogram) which (implicitly) contains all values of k
  • on one end of the tree there are k clusters each with one object and on the other end of the tree there is one cluster containing all k objects
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What do we need to cluster?

A
  • needed for the clustering is a complete set, i.e. a matrix, of the (dis)similarities between all objects.
  • dissimilarity coefficients may be obtained from the
    computation of distances (see last section)
  • Euclidean distance, Manhattan distance, correlation, coefficient based distances are usually used
  • high quality clusters with:
    – high intra-class similarity
    – low inter-class similarity
  • good: small circles, long lines
  • bad: large circles, short line
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Distance Matrix

A
  • you can generate the distance matrix from your data by dist function:
    dist.mt = dist(data, method =´manhattan´)
  • be carefully to create a distance matrix for the row or the column items, transpose your matrix if you want to switch between row and column item distances
  • if the distance matrix is not generated by the dist function: use dist.mt = as.dist(matrix) to generate a distance
    matrix from a “manually” made distance matrix
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Agglomerative Clustering

A
  • a distance matrix is required
  • agglomerative methods start with n clusters and proceed by successive fusions until a single cluster is obtained
  • 1st: two “most similar” objects are joined into cluster
  • 2nd: distance matrix is reduced as rows of joined objects are merged
  • hclust: hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

hclust: hierarchical agglomerative clustering

A
  • performed from a distance matrix which is stepwise reduced
  • repeated recalculation for distance of merged rows to all
    remaining rows
  • single linkage to merge closest rows (use smallest distance value)
    – complete linkage to merge closest rows (use the largest distance value)
    – average linkage to merge closest rows (use the average distance value)
  • default for hclust is complete linkage
  • computationally intensive
  • don’t do this for 20.000 genes, but do this for 100 samples
  • transpose matrix correctly to samples, not genes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly