# Clustering Flashcards

1
Q

Learning curve

A

Shows how accuracy changes with varying sample sizes.
Requires sampleing schedule for creating learning curve.
Small sample size -> bias in the estimate + variance of estimate

2
Q

Methods of estimation

A
```Partitioning labeled data in:
- training set for model building
- test set for model evaluation
Partitioning techniques:
-holdout
-crossvaluation```
3
Q

Holdout

A

Fixed partitioning: reserve 2/3 for training 1/3 for testing

Appropriate for large datasets: may be repeated several times (repeated holdout)

4
Q

Crossvalidation

A

Crossval:
- parition data into k disjoin subsets
- k-fold: train on k-1 paritions, test on te remaining one, repeat for all folds
- reliable accuracy estimation, not appropriate for very large datasets.
- this is used for estimating accuracy, but at the end the model is retrained on the whole labeled dataset.
Leave-one-out
- cros validation for k=n
- only appropriate for very small datasets

5
Q

Limitations of accuracy

A
```Cardinality calss 0: 9900
Cardinality class 1: 100
Model: () -> 0
Accuracy is not appropriate for unabalanced class label distribution and different class relevance```
6
Q

Recall,Precision

A
```Recall = Number of objects assigned to c / Number of objects belonging to C
Precision = Number of objects assigned correctly to c / Number of objects assigned to c```

We need to balance the two:
maximize ( F-measure = 2rp/(r+p) )

Not always the F measure is important, sometimes we want to maximize one of the two, or another combined measure.

7
Q

ROC

A

y: TPR (True positive rate): TP/(TP+FN)
x: FPR (False positive rate) FP/(FP+TN)
TP: true positive, FN: false negative, FP: False positive

8
Q

Hierarchical clustering, intro

A

Produces a set of nested clusters organized as hierarchical tree.
We iterate and cluster nested clusters with unclustered points or other nested clusters. The algorithm ends when there is only one cluster, we can then cut the dendogram at a lower level to get a number > 1 of clusters.
The dendogram shows an the y axis the logical distance between the merged clusters, by means of the chosen metric.
Memory complexity: O(n^2)
Time complexity: O(n^3) some times reduced to O(n^2*log(n))

9
Q

Hierarchical clustering, algo

A
```Basic algorithm is straightforward
Compute the proximity matrix
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the proximity matrix
Until only a single cluster remains```
10
Q

MIN, MAX, Group Average

A

Proximity measures between clusters: please note that the first step in MIN is the same of MAX, finding the minimum distance between points, we use other proximity measures when we have clusters, not single points.
MIN:
- PROS: can handle non elliptical shapes -> k-means can’t
- cons: sensitive to noise and outliers
MAX: select in the matrix the minimum max distance between points in the clusters
- PROS: less susceptible to noise and outliers
- Cons: tends to break large clusters, and is biased towards globular clusters
Group average: Compromise between MIN and MAX
-pros: less susceptible to noise and outliers
-cons: biased toward globular clusters
Ward’s Method:
- Similarity of two clusters is based on the increase in squared error when two clusters are merged
- Analogue of K-means, can be used to initalize K-means

11
Q

DBSCAN

A

The only parameter is density and minPts: it uses density to define clusters, the number of clusters is decided by the algo.
MinPts is the minimum number of points in the cluster.
Density: number of points within a specified radius (Eps)
- A point is a core point if it has more than a specified number of points (MinPts) within Eps: These are points that are at the interior of a cluster
- A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point
- A noise point is any point that is not a core point or a border point.

Pros:
- Resistant to Noise
- Can handle clusters of different shapes and sizes
Cons:
- Doesn’t recogniezes clusters with non-uniform densities and high dimensional data.
- We can counter that by applying the algortihm two times, one with high density and one with higher density on the noise points of the first algo.

Determining Epsilon: distance of every point to its kth nearest neighbour.

12
Q

Measures of cluster

A

Cluster cohesion: measures how closely related are objects in a cluster (WSS)
Cluster separation: measure how distinct or well-separated a cluster is from other clusters (BSS)
Silhoutte: a succint measure to evaluate how well each object lies within its cluster
- for each object i
- a(i) the average dissimilarity of i with all other data within the samecluster (the smaller the value, the better the assignment)
- b(i) is the lowest average dissimilarity of i to any other cluster, of which i is not a member
- the average s(i) over all that of the data of a cluster measures how tightly grouped all the data in a cluster is
- s(i) = (b(i)-a(i))/max{a(i),b(i)}

The validation of clustering structures is the most difficult and frustrating part of cluster analysis.

13
Q

Application of cluster analysis

A
• Understanding: group related documents for browsing, group genes and proteins that have similar functionality, group stock with similar price fluctuations
• Summarization: reduce size of data sets
14
Q

Types of clustering

A

Partitioninal: division into non overlapping subsets, such that each data object is in exactly one subset
Hierachical : a set of nested clusters
Exclusive vs non exclusive
Fuzzy: a point belongs to every cluster with some weight
Partial versus complete: cluster only some data
Heterogeneus: widely different clusters

15
Q

K-means

A

Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple:
1.Select k points as initial records
2.repeat
3. form k clusters by assigning all points to the closest centroid
4. recompute the centroid of each cluster
5.until the cetroids don’t change

Choosing the inital centroids is very important, with k, the number of centroids.

16
Q

Evaluating K-means clusters

A

Most common way: Sum of squared error (SSE) -> for each point the error is the distance to the nearest cluster -> SSE = SUM(i = 1, K, SUM(x, Ci, dist^2(mi, x)))
- x is a datapoint in cluster Ci and mi is the representative point for Ci.
- Given two clusters, we can choose the one with the smallest error
One easy way to reduce SSE is to increase K, the number of clusters.
- A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

17
Q

K-means parameter setting

A
• We can plot the quality measure (usally SSE) agains k

- We choose k when the gain of adding a centroid is negligible, and the reduction of quality is not interesting

18
Q

K-means solutions to inital centroids problem

A
• Multiple runs
• sample and use hierarchila clustering
• select more than k inital centroid and then select among these initial centroid, selecting the most widely separated
• Bisecting K-means: not as susceptible to initialization issues
• Postprocessing: eliminate small clusters that may represent outliers, split loose clusters, merge clusters that are close, can also use this steps during the clustering process.
19
Q

Bisecting K-means

A

At each iterations generates only 2 clusters, with random positioned centroids, then select the cluster with worst SSE, divide it in 2 and repeat.
I stop when the total SSE is good enough.

20
Q

Limitations of k-means

A

K-means has problems when clusters are of differing

• Sizes
• Densities
• Non-globular shapes

Since every datapoint has to be part of a cluster, outliers are not handled well.

21
Q

Bayesian classification, theorem

A

Let C and X be random variables
P(C,X) = P(C|X) P(X)
P(C,X) = P(X|C) P(C)
Hence P(C|X) P(X) = P(X|C) P(C) and also P(C|X) = P(X|C) P(C) / P(X)

22
Q

Bayesian classification, construction

A
```C = any class label
X =  record to be classified```

What to do:

• compute P(C|X) for all classes
• probability that record X belongs to C
• assign X to the class with maximal P(C|X)

Applying Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X)
- P(X) constant for all C, disregarded for maximum computation
- P(C) a priori probability of C
P(C) = Nc/N

Estimating P(X|C) -> Naïve hypothesis := P(x1,…,xk|C) = P(x1|C) P(x2|C) … P(xk|C) (statistical indipendence of attributes x1,….xk

23
Q

Bayesian classification, calculating P(xk|C)

A
```For discrete attributes P(xk|C) = |xkC|/ Nc
where |xkC| is number of instances having value xk for attribute k and belonging to class C```

for continuous attributes, use probability distribution

24
Q

Support Vector Machines

A

Define a hyperplane that maximizes margin between two sets.
There are also Nonlinear support vector machines, use kernels to map sets into a space where the problem becomes linear.

There are lots of paremeters to tinker with.

25
Q

K-nearest neighbors

A

A use the K nearest neighbors to the data point I have to classify, and assigns it to the voted/best class for a given measure.

The parameter K is not easy to select.
K too small ->sensitive to noise
K too big -> neighbourhood may include points from other classes.

Pro:
- incremental (I can add training data without recalculating model)
- model does not exist (no training time)
Cons:
- Model is not interpretable
- Classification is not fast

26
Q

Accuracy

A

(TP+TN)/(TP+TN+FP+FN)