# Clustering Flashcards

Learning curve

Shows how accuracy changes with varying sample sizes.

Requires sampleing schedule for creating learning curve.

Small sample size -> bias in the estimate + variance of estimate

Methods of estimation

Partitioning labeled data in: - training set for model building - test set for model evaluation Partitioning techniques: -holdout -crossvaluation

Holdout

Fixed partitioning: reserve 2/3 for training 1/3 for testing

Appropriate for large datasets: may be repeated several times (repeated holdout)

Crossvalidation

Crossval:

- parition data into k disjoin subsets

- k-fold: train on k-1 paritions, test on te remaining one, repeat for all folds

- reliable accuracy estimation, not appropriate for very large datasets.

- this is used for estimating accuracy, but at the end the model is retrained on the whole labeled dataset.

Leave-one-out

- cros validation for k=n

- only appropriate for very small datasets

Limitations of accuracy

Cardinality calss 0: 9900 Cardinality class 1: 100 Model: () -> 0 Accuracy is not appropriate for unabalanced class label distribution and different class relevance

Recall,Precision

Recall = Number of objects assigned to c / Number of objects belonging to C Precision = Number of objects assigned correctly to c / Number of objects assigned to c

We need to balance the two:

maximize ( F-measure = 2rp/(r+p) )

Not always the F measure is important, sometimes we want to maximize one of the two, or another combined measure.

ROC

y: TPR (True positive rate): TP/(TP+FN)

x: FPR (False positive rate) FP/(FP+TN)

TP: true positive, FN: false negative, FP: False positive

Hierarchical clustering, intro

Produces a set of nested clusters organized as hierarchical tree.

We iterate and cluster nested clusters with unclustered points or other nested clusters. The algorithm ends when there is only one cluster, we can then cut the dendogram at a lower level to get a number > 1 of clusters.

The dendogram shows an the y axis the logical distance between the merged clusters, by means of the chosen metric.

Memory complexity: O(n^2)

Time complexity: O(n^3) some times reduced to O(n^2*log(n))

Hierarchical clustering, algo

Basic algorithm is straightforward Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains

MIN, MAX, Group Average

Proximity measures between clusters: please note that the first step in MIN is the same of MAX, finding the minimum distance between points, we use other proximity measures when we have clusters, not single points.

MIN:

- PROS: can handle non elliptical shapes -> k-means can’t

- cons: sensitive to noise and outliers

MAX: select in the matrix the minimum max distance between points in the clusters

- PROS: less susceptible to noise and outliers

- Cons: tends to break large clusters, and is biased towards globular clusters

Group average: Compromise between MIN and MAX

-pros: less susceptible to noise and outliers

-cons: biased toward globular clusters

Ward’s Method:

- Similarity of two clusters is based on the increase in squared error when two clusters are merged

- Analogue of K-means, can be used to initalize K-means

DBSCAN

The only parameter is density and minPts: it uses density to define clusters, the number of clusters is decided by the algo.

MinPts is the minimum number of points in the cluster.

Density: number of points within a specified radius (Eps)

- A point is a core point if it has more than a specified number of points (MinPts) within Eps: These are points that are at the interior of a cluster

- A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

- A noise point is any point that is not a core point or a border point.

Pros:

- Resistant to Noise

- Can handle clusters of different shapes and sizes

Cons:

- Doesn’t recogniezes clusters with non-uniform densities and high dimensional data.

- We can counter that by applying the algortihm two times, one with high density and one with higher density on the noise points of the first algo.

Determining Epsilon: distance of every point to its kth nearest neighbour.

Measures of cluster

Cluster cohesion: measures how closely related are objects in a cluster (WSS)

Cluster separation: measure how distinct or well-separated a cluster is from other clusters (BSS)

Silhoutte: a succint measure to evaluate how well each object lies within its cluster

- for each object i

- a(i) the average dissimilarity of i with all other data within the samecluster (the smaller the value, the better the assignment)

- b(i) is the lowest average dissimilarity of i to any other cluster, of which i is not a member

- the average s(i) over all that of the data of a cluster measures how tightly grouped all the data in a cluster is

- s(i) = (b(i)-a(i))/max{a(i),b(i)}

The validation of clustering structures is the most difficult and frustrating part of cluster analysis.

Application of cluster analysis

- Understanding: group related documents for browsing, group genes and proteins that have similar functionality, group stock with similar price fluctuations
- Summarization: reduce size of data sets

Types of clustering

Partitioninal: division into non overlapping subsets, such that each data object is in exactly one subset

Hierachical : a set of nested clusters

Exclusive vs non exclusive

Fuzzy: a point belongs to every cluster with some weight

Partial versus complete: cluster only some data

Heterogeneus: widely different clusters

K-means

Partitional clustering approach

Each cluster is associated with a centroid (center point)

Each point is assigned to the cluster with the closest centroid

Number of clusters, K, must be specified

The basic algorithm is very simple:

1.Select k points as initial records

2.repeat

3. form k clusters by assigning all points to the closest centroid

4. recompute the centroid of each cluster

5.until the cetroids don’t change

Choosing the inital centroids is very important, with k, the number of centroids.