Clustering final Flashcards

1
Q

What is clustering all about?

A

Clustering is about dividing data into groups that are meaningful or useful or both.

If “meaningful” groups of data is the goal, we want the clustering to capture the nature of the data.

however, if we look at clusteirng as “useful”, one can imagine creating clusters with prototypes that are essentially used for summation purposes, aggregation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What can we say about a cluster?

A

A cluster will contain data instances that all share certain similarities. They are in many ways equal or similar to each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Elaborate on clustering for prototypes

A

Clustering for prototypes refers to clustering with the goal of creating k prototypes that aggregate the entire information. Instead of having literally millions of data points, we can create k amount of clusters, and receive k prototype points that represent the entire data point collection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the goal of clusteirng

A

the goal can be described as identifying groups of data so that data in a group is very similar to the other data in the same group, and at the same time is very different from data in other groups/clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Are clusters easily definable?

A

No. It is not always easy to understand what is and what is not a cluster. There are many different patterns that can be difficult to interpret.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

another (informal) word for clustering

A

unsupervised classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a clustering?

A

A clustering is the entire scheme of multiple (if k>1) clusters. It is just plural.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Elaborate on hierarchical vs partitional clusterings

A

Nested vs unnested.

A partitional clusteirng is a clustering where we divide data into non-overlapping subsets.

A hierarchical clustering is overlapping and work at multiple levels. In a hierarchical clustering, each node (which is a clustering) except for the leaf nodes, is the union of all of its children clusters.

Hierarchical clusterings are typically pictured as trees. It is also not uncommon to find them having singleton clusters at the leaf nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Elaborate on exclusive vs overlapping vs fuzzy clusterings

A

Exclusive: Assign one cluster to each data point, and only one cluster.

Overlapping: Allows data points to be assinged into several clusters. ex: A person can be a studnet AND worker.

Fuzzy: Every data object belongs to every cluster, but with a certain weight that associate them with a “strength”. The strength level is between 0 and 1. 1 is complete inclusion, while 0 is not at all.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Elaborate on complete vs partial clustering

A

Complete clustering assign every object to some cluster, while the partial clustering does not. The motivation behind partial clusteirngs is that some objects just dont really fit in to the possible clusters, and if we force them to, we fuck up the cluster identify.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

there are different types of clusters. Name them

A

Well-separated. The distance between two points in different groups is larger than the distance between any two points both in the same group.
Well-separated clusterings does not need to have a globular shape.

Prototype based. For prototype based clusterings, each data point in a cluster is in the cluster because it is closer to the prototype of this specific cluster than it is to any other prototype.
the prototype is often a centroid, especially for continuous attributes.
In other cases, ex with categorical data, the prototype can be a medoid.
Prototype based clusterings are also called center-based.

Graph-based. This one is all about how it idenifies cluster membership. Graph based methods are very flexible in the regards that they allow every member of a cluster to take part directly in determining whether a new point should be accepted or not. this is a contrast with the prototype based one considering the fact that prototype based is using a single data point/centroid to make the decisions. Because of this, graph based is able to create non-globular shapes very well. With min link, it can create stringy and elongated shapes.
Graph based methods create clusters are connected components.

Density based. We consider a clustering to be a region that is dense in data points compared to the region outiside of the immediate cluster wall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

there are 3 algorithms of itnerest, which ones

A

K-means

AHC: Agglomerative Hierarchical Clustering

DBSCAN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Elaborate on k-means procedure

A

We first choose a number of clusters, k.

We spread out k centroids.

Each point in the entire data set of assigned to its closest centroid.

We update each centroid based on its entire cluster.

We keep the centroids, and re-assign all data points, essentially giving them a chance to change cluster membership to a more suitable one.

Repeat centroid update.

Repeat until convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the centroid in k-means

A

The mean of a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

in k-means, when does “most happen”?

A

In the early stages.

bercause of this, we typically only repeat the k-means procedure steps until convergence is below a certain threshold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what do we “need” to run k-means?

A

A proximity measure. For instance the euclidean distance.

Cosine similarity can be more useful for documents.

Others include jaccard distance, manhattan distance.

WE ALSO NEED AN OBJECTIVE FUNCTION. For instance, “minimize the sum of squared distances”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Elaborate on SSE

A

Sum of squared error.

We use it like this:
We consider the “error” to be the distance between a data point and its closest prototype. Therefore, the goal is naturally to minimize the total sum of squared errors. Meaning, we sum together the squared error for all of our data points, and use this value as a measure that we want to minimize. The smaller this SSE is, the better our clustering is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the problem with k-means

A

Difficult to find optimal starting locations for the centroids.

The fact is that when we initalize k-means with random locations for the centroids, we will get different results each time we run it.

one simple way of mitigating some of the problem with this, is to run multiple times and use the best clustering we find based on SSE. This is sort of a prayer method, and it has no guarantees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What do we typically do to combat the problems of k-means

A

k-means++. this is a method for initializing k-means.

The procedure is guaranteed to find a clustering that is optimal within a factor of O(logk). this is really good.

k-means++ picks centroids incrementally until k centroids have been picked. At every step/pick, each point has a probability of being picked as centroid that is proportional to the square of its distance to its closest centroid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the space complexity of k-means

A

O((m+k)n), where m is the number of data points, k is the number of clusters, and n is the number of attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the time complexity of k-means?

A

Basically linear in terms of the size of the data set:

O(IxKxmxn).

I is the number of iterations needed for convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Name some design issues with k-means

A

Potential of empty clusters. In such cases, all we really need to do is assign a replacement centroid.

Outliers. we fucking hate outliers, typically we want to eliminate them in preprocessing. the point is that they fuck up the centroid.
However, certain tasks requires the use of outliers. for instance, data compression.

Reducing SSE but not increasing the number of clusters. We know that we will reduce SSE by increasing k. However, increasing k is not always desirable. Therefore, we need methods to decerase SSE while keeping k low. One common approach is to alternate between splitting and merging clusters (POST PROCESSING) based on the lowest SSE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Elaborate on bisecting k means

A

Bisecting k-means can be very good if we dont know how many clusters we want. We can run it, receivign more and more clusters until the reduction in SSE becomes insignifiucant (elbow). Then, we use these final cluster centroids as initial centroids in regular k-means.

NB: when we use bisecting k-means, we are doing local optimal steps. The global final result is not necessarily the best SSE. Therefore, it is common to use the final result of bisecting k-means to find starting points for regular k-means.

Basically starting out with a singlecluster (all point belong to the same cluster) and then split it. Then we have 2 clusters. Now we want to split ONE OF THEM. We split it based on a splitting criterion, for instance based on SSE or just simply size.
we continue splitting according to the splitting criterion until we reach k clusters.

it is typical to use bisecting k-means to find centroids that we can use for initial cluster-centroid in regualr k-means.

The reason why it is called bisecitng KMEANS is because each time we split, we use k-means. This makes sure that the split is good.
for instance, we have the first cluster consisting of all points. We use k-means to split, find 2 centroids, and get a good split.

Note how bisecting k-means also produce a type of hierarchical clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

major drawback of k-means?

A

Tend to not work well on non-spherical shapes, or different densities or different size.

For instance, if one natural cluster is very large compared to the others, k-means will struggle.

why is it struggling with such cases?
It is because of the objective function. Minimizing SSE will always favor globular shapes of equal size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Agglomerative vs divisive

A

Agglomerative starts with each singleton cluster, merge together iteratively to form larger and larger clusters.

Divisive begin at the top and perofrm spltting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

how do we visualize HAC?

A

Dendrograms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Elaborate on the HAC algorithm

A

Starts with singleton clusters.

Initialize the proximity matrix.

MERGE the two clusters that are closest together.

Update the proximity matrix.

Repeat until only one cluster remains.

HAC requires a notion of proximity between clusters. Obviously, because thisi s the basis of the proximity matrix that is used for the algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Elaborate on proximity and the proximity matrix in regards to HAC

A

The proximity measure is actually usually the thing differing between various methods of HAC.

For instance, we typically use MIN and MAX from the graph based clustering view.

MIN defines cluster proximity as the shortest distance between any node in cluster A and any node in cluster B.

MAX defines cluster proximity as the farthest two points in different clusters.

We also have GROUP AVERAGE which measure proximity as the average of all pairwise connections.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is Wards method?

A

Assumes clusters are represented by centroids. Considers proximity between two clusters as the increase in SSE that would result from merging the two clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Elaborate on single linkage vs complete linkage in terms of abilities and results

A

Single linkage usually yield stringy and elongated patterns. it excels at capturing weird complex shapes.

Complete linkage usually yield compact and very similar clusters. Tend to be more spherical. Place a greater emphasis on keeping clusters similar. Complete linkage avoids chaining. Chaining is something we typically see with single linkage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Why does k-means suck compared to HAC?

A

k means suck because it tries to mnimize the objective function in SSE. Because of SSE, it will struggle with varying cluster sizes, densities etc. Just imagine what happens if we have one extremely large natural cluster and other small clusters. k means will split the large one into multiple clusters, because this will most likely reduce SSE. Major drawback.

Complete linkage on the other hand will dominate on them aspects.

Unfortunatley, HAC is very slow and requires lots of space. not good for large sets. Does not scale very well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is DBSCAN based on?

A

Center based method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Define density, as defined by center based methods

A

We look at some point, and draw a circle around it with some radius r. All the points that lie inside the circle, INCLUDING the point itself, is regarded as the density level.

34
Q

What is the radius in center based density methods called?

A

EPS

35
Q

What happens if the EPS is very large?

A

We risk having a case where all point have the entire data set as its dense region.

36
Q

What different points do we have in density and center based methods?

A

Core points.

Border points.

Noise points.

37
Q

elaborate on core points

A

The point about core points is that they are well within the dense cluster. In order to achieve this, we say that a point is a core point if its “points within EPS” is greater than, or equal to, some specified level. We call this level “minPts”. Som EPS >= minPts is a requirement for core points.

38
Q

Elaborate on border points

A

border points are not core points, but they are located so that some (at least one) core point has this point inside of its dense region.
In a sense, the core point connects to this border point.

39
Q

Elaborate on noise points

A

all points that are neither core nor border

40
Q

give the DBSCAN procedure

A

We first label all points into categories of core, border and noise.
Then we remove all noise.
Then we put an edge between all core points that are within EPS of each other.
Each group of connected core points constitute a cluster.
Assign each border point into of the clusters of its associted core points.

41
Q

How do find a suitable value for EPS and minPts?

A

We compute the k-dist, which is the distance to the k’th nearest neighbor. We compute this for all points, and sort them. Then we will see a sudden sharp increase in distance to the k’th nearest neighbor. This distance is what we want to use for EPS

42
Q

what is good about DBSCAN?

A

Due to being density based, it is very good at finding weird shapes.

43
Q

What is bad about DBSCAN?

A

It can struggle with different levels of densities. This is because we have to find levels to EPS and minPts. If these values need to be very different in the different clusters, we have a case.

44
Q

Name the important issues regarding cluster evaluation

A

1) Determining whether there is a clustering tendency in the data. This is basically just figuring out if there is just random data, or if there actually exists non-random structures in the data.

2) Determining the correct number of clusters.

3) Evaluating how well the clusters fit the data without referencing external information.

4) Comparing the cluster results to externally known results, such as externally provided class labels.

5) Comparing two sets of clusters to see which one is better.

1,2,3 are UNSUPERVISED.
4 is supervised.
5 can be both.

45
Q

name an issue regarding cluster evaluation that is not among the 5?

A

“Do we want to evaluate the entire clustering, or just one specific cluster?”

46
Q

There are 3 types of evaluation measures used in clustering analysis. What are they?

A

1) Supervised. measure how the structure compares to external structures. Can use entropy in regards to checking how well the clustering “label” different groups.

2) Unsupervised. For instance SSE. Unsupervised measues usually divide into two groups: One is cluster cohesion. The other is cluster separation. Cluster cohesion is about how tight a cluster is. Cluster separation is about how well it distnace itself from the other clusters (isolation).
the benefit of unsupervised measures is that they do NOT require any information that aitn present in the cluster.

3) relative. this is more of a comparison of either supervised or unsupervised measures. it is not really a measure of its own.

47
Q

how can we, generally/abstractly, describe the overall cluster validity?

What is the benefit of this representation+

A

We use a weighted sum of the individual cluster validities:

Overall Validity = ∑wi x Validity(Ci) [I=1, k]

The benefit of this lies in the flexibility of the “validity(Ci)” function. Validity can be cohesion or it can be separaiton. Thus, the function fits both measures.

48
Q

In graph based views, what is cohesion and what is separation?

A

Cohesion is proximity between all points inside the cluster.

Separation is proximity between all points in one cluster, and all points in another cluster.

49
Q

In prototype based views, what is cohesion and what is separation+

A

Cohesion is proximity from all points to the centroid.

Separation is the single proximity between centroids.

note that proximity can be defined as SSE among others.

50
Q

Elaborate on SSB

A

SSB is defined as “between group sum of squares”.

SSB is a measure we use when we want to measure how “separated” the different clusters are to each other. SSB is calculated by finding the distance between each centroid and the total mean of the entire data set, sum all of these distances together along with their corresponding weight. The weight can be related to relative size.
Based on this, a large value is ideal as it indicate separation to a higher degree. So, if we have say two clusterings, we could compare them on the basis of SSB to understand which one is more separated.

51
Q

What is the important TSS result?

A

minimizing SSE is the same as maximizing SSB.

This essentially means that TSS = SSE+SSB which means that TSS is a constant. If we reduce SSE, we increase SSB and vice versa.

TSS is the sum of squares from each point to the overall mean of the data.

52
Q

Elaborate on the silhouette coefficient

A

Method that combines cohesion and separation.

Procedure:
1) For the i’th object, calcualte its average distance to all other point in the cluster. We call this average distance “ai”.
2) For th i’th object, and any other cluster not containing the object, calculate the object’s average distance to all the objects in the given cluster. We do this for all clusters that the object is not in. Find the minimum such value with respect to all clusters. Call this value “bi”.
3) for the i’th object, the silhouette coefficient is given as si = (bi-ai)/max(ai, bi)

the silhouette coefficient value can vary between -1 and 1. A negative value is undesirable because it corresponds to the case where ai is greater than bi. This would mean that the average distance inside a cluster is actually greater than the minimum average distance to points in another cluster.

We want ai to be as close to 0 as possible, which allows silhuette a score of near 1.

53
Q

Silhouette coefficient is a single-point metric. how do we make it cluster representative?

A

Just use the average silhouette coefficient in a cluster.

We can also extend this to make it a silhouette coefficient of the entire clustering by taking the average of ALL silhouette coefficents.

54
Q

What is cophenetic distance?

A

proximity at which HAC puts two objects in the same cluster for the first time.

55
Q

TODO: Ward

A

Wards method defines cluster proximity as the increase in SSE from merging the clusters.

Bercause of this, Ward’s method use the same objective function as k-means.

56
Q

TODO: Group average

A
57
Q

WHatis the time complexity of hierachical agglomerative clustering?

A

There are n^2 values in the proximity matrix, and we need to perform n steps.

Therefore, O(n^3).

With some clever manipulation, we can reduce this to O(n^2log(n))

58
Q

How do we find similarity matrix?

A

for each entry in the proximity matrix, we perform the computation “1 -( [this entry’s value in proximity matrix]-[minimum distance in entire proximity matrix])/([maximum distance in proximity matrix]-[minimum distance in proximity matrix]).

in other words:

entry_in_sim = 1 - (d - dmin)/(dmax - dmin)

The benefit of this is that we can assign for instance color spectrum to the values and watch how similar the clusters are.

We use the matrix of each point to get the full color experience.

59
Q

What is the cophenetic distance matrix?

A

A matrix where all entries are the cophenetic distances between each other.

Recall cophenetic distance is the proximity at which HAC puts the elements in the same cluster for the first time.

60
Q

Elaborate on the Hopkins statistic

A

Approach to trying to see if a data set has clusters without clustering. We have to somehow measure spatial randomness.

Hopkins statistic goes like this:

We generate p points that are randomly distributed across the data space and also sample p actual points from the data set.
For both sets of points, we find the distance to the nearest neighbor in the original set data set.

Let ui be artificially generated nearest neighbor distances.
Let wi be the actual nearest neighbior distances.

The statistic (Hopkins) is formula like this:
H = ∑wi / (∑ui + ∑wi)

If the artificial points and the actual sample points have roughly the same nearest neighbor distance, the Hopkins statistic will be approx 0.5. Values close to 0 indicate high level of clustering.

If the data is d-dimensional, we must increase eahc term to the power of d.

61
Q

The best centroid position when using SSE as objective function is…?

A

The mean of the points in a cluster

62
Q

regarding cluster validation, we abstractly consider the overall validity of a set of K clusters as

overall validity = ∑wi validity(Ci) [i=1, K]

What is the validity function?

A

Can be many different thinhgs.

Common ones include “cohesion”, “separation” etc.

The weights depend typically on cluster size.

63
Q

In a graph based view of clusterung, how is cohesion and separation defined?

A

Cohesion: Sum of edgeweights in a cluster. the weights is typically distance.

Separation: Sum of weights on edges going from all points in one cluster, to another cluster.

64
Q

In a prototype based view of clustering, how is cohesion and separation defined?

A

Cohesion: Sum of proximities from each point to the centroid.

Separation: centroid to centroid distance. alternatively, centroid to mean in data set. This is related to SSB.

65
Q

what is the danger of using SSE too much in cluster evaluation?

A

SSE always want more clusters.
SSE is only measuring cluster cohesion, not separation.
If we use silhouette coefficient, we get cluster separation as well.

Recall that the peak in silhouette coefficient is what we want.

66
Q

what is the daunting fact about clustering algorithms?

A

All algos will find clusters, regardless of whether there actually are clustering tendencies in the data set.

67
Q

How can we evaluate clustering tendencies in the dataset without performing custering?

A

Hopkins

68
Q

why do we do clustering?

A

primarily 2 reasons:
1) Understanding
2) Summarization

69
Q

TODO: How to select a number of clusters

A

If we have access to run for instance k means many times, we can use the error curve. Meaning, we run k-means first with 1 cluster, then with 2 clusters, then 3 etc, and for each step, we save the SSE, and create a plot. We will usually see a shape that look like a bent elbow, and we want to select the number of clusters that gives real decrease in SSE, but no more. This is the elbow.

If we do not have access to k-means multiple times, we can use the k-nearest distance

70
Q

TODO: Distinction between internal and external cluster evaluation schemes

A
71
Q

TODO: Contiguity based cluster

A
72
Q

TODO: Well separated cluster

A
73
Q

How do we do pre processing in clustering?

A

Pre processing in clustering is basically normalization and removal of outliers.

74
Q

How does the algorithms perform on outliers?

A

k-means suck.
k.

75
Q

What are the strengths of HAC?

A

We do nto have assume anythng about the number of clusters.

We can get any wanted number of clusters by cutting the dendrogram at the correct level.

HAC is great at making taxonomies, like animal kingdom.

76
Q

Limiations of MAX?

A

tend to break large clusters, and is biased towards global shapes

77
Q

How is cluster separation measured?

A

As SSB, “between cluster sum of squares”.

Formula: SSB = ∑Ci (m - mi)^2 [i=1, k]

Ci er størrelsen på cluster “i”.
m er mean for hele datasettet.
mi er mean i cluster i.

78
Q

is correlation a good measure of cluster evaluation?

A

Can be. Not good for density based

79
Q

Elaborate on the general hopkins statistic

A

The general hopkins statistic add the power of d to all the distances. This allows us to tweak it and give more or less weighting on smaller vs larger inter-cluster distances. If d is larger than 1, we are effectively saying that we allow the distance between points in a cluster to be larger. Opposite with d<1.

80
Q

What is SSB

A

Between group sum of squares. It is calculated as the sum (summing over all the clusters) where each term is the cluster size multiplied by the squared distance between the cluster centroid and the overall data set mean.

81
Q
A