Clustering final Flashcards

Question 1

Q

What is clustering all about?

Answer

A

Clustering is about dividing data into groups that are meaningful or useful or both.

If “meaningful” groups of data is the goal, we want the clustering to capture the nature of the data.

however, if we look at clusteirng as “useful”, one can imagine creating clusters with prototypes that are essentially used for summation purposes, aggregation.

Question 2

Q

What can we say about a cluster?

Answer

A

A cluster will contain data instances that all share certain similarities. They are in many ways equal or similar to each other.

Question 3

Q

Elaborate on clustering for prototypes

Answer

A

Clustering for prototypes refers to clustering with the goal of creating k prototypes that aggregate the entire information. Instead of having literally millions of data points, we can create k amount of clusters, and receive k prototype points that represent the entire data point collection.

Question 4

Q

What is the goal of clusteirng

Answer

A

the goal can be described as identifying groups of data so that data in a group is very similar to the other data in the same group, and at the same time is very different from data in other groups/clusters.

Question 5

Q

Are clusters easily definable?

Answer

A

No. It is not always easy to understand what is and what is not a cluster. There are many different patterns that can be difficult to interpret.

Question 6

Q

another (informal) word for clustering

Answer

A

unsupervised classification

Question 7

Q

What is a clustering?

Answer

A

A clustering is the entire scheme of multiple (if k>1) clusters. It is just plural.

Question 8

Q

Elaborate on hierarchical vs partitional clusterings

Answer

A

Nested vs unnested.

A partitional clusteirng is a clustering where we divide data into non-overlapping subsets.

A hierarchical clustering is overlapping and work at multiple levels. In a hierarchical clustering, each node (which is a clustering) except for the leaf nodes, is the union of all of its children clusters.

Hierarchical clusterings are typically pictured as trees. It is also not uncommon to find them having singleton clusters at the leaf nodes.

Question 9

Q

Elaborate on exclusive vs overlapping vs fuzzy clusterings

Answer

A

Exclusive: Assign one cluster to each data point, and only one cluster.

Overlapping: Allows data points to be assinged into several clusters. ex: A person can be a studnet AND worker.

Fuzzy: Every data object belongs to every cluster, but with a certain weight that associate them with a “strength”. The strength level is between 0 and 1. 1 is complete inclusion, while 0 is not at all.

Question 10

Q

Elaborate on complete vs partial clustering

Answer

A

Complete clustering assign every object to some cluster, while the partial clustering does not. The motivation behind partial clusteirngs is that some objects just dont really fit in to the possible clusters, and if we force them to, we fuck up the cluster identify.

Question 11

Q

there are different types of clusters. Name them

Answer

A

Well-separated. The distance between two points in different groups is larger than the distance between any two points both in the same group.
Well-separated clusterings does not need to have a globular shape.

Prototype based. For prototype based clusterings, each data point in a cluster is in the cluster because it is closer to the prototype of this specific cluster than it is to any other prototype.
the prototype is often a centroid, especially for continuous attributes.
In other cases, ex with categorical data, the prototype can be a medoid.
Prototype based clusterings are also called center-based.

Graph-based. This one is all about how it idenifies cluster membership. Graph based methods are very flexible in the regards that they allow every member of a cluster to take part directly in determining whether a new point should be accepted or not. this is a contrast with the prototype based one considering the fact that prototype based is using a single data point/centroid to make the decisions. Because of this, graph based is able to create non-globular shapes very well. With min link, it can create stringy and elongated shapes.
Graph based methods create clusters are connected components.

Density based. We consider a clustering to be a region that is dense in data points compared to the region outiside of the immediate cluster wall.

Question 12

Q

there are 3 algorithms of itnerest, which ones

Answer

A

K-means

AHC: Agglomerative Hierarchical Clustering

DBSCAN

Question 13

Q

Elaborate on k-means procedure

Answer

A

We first choose a number of clusters, k.

We spread out k centroids.

Each point in the entire data set of assigned to its closest centroid.

We update each centroid based on its entire cluster.

We keep the centroids, and re-assign all data points, essentially giving them a chance to change cluster membership to a more suitable one.

Repeat centroid update.

Repeat until convergence.

Question 14

Q

What is the centroid in k-means

Answer

A

The mean of a cluster

Question 15

Q

in k-means, when does “most happen”?

Answer

A

In the early stages.

bercause of this, we typically only repeat the k-means procedure steps until convergence is below a certain threshold.

Question 16

Q

what do we “need” to run k-means?

Answer

A

A proximity measure. For instance the euclidean distance.

Cosine similarity can be more useful for documents.

Others include jaccard distance, manhattan distance.

WE ALSO NEED AN OBJECTIVE FUNCTION. For instance, “minimize the sum of squared distances”.

Question 17

Q

Elaborate on SSE

Answer

A

Sum of squared error.

We use it like this:
We consider the “error” to be the distance between a data point and its closest prototype. Therefore, the goal is naturally to minimize the total sum of squared errors. Meaning, we sum together the squared error for all of our data points, and use this value as a measure that we want to minimize. The smaller this SSE is, the better our clustering is.

Question 18

Q

What is the problem with k-means

Answer

A

Difficult to find optimal starting locations for the centroids.

The fact is that when we initalize k-means with random locations for the centroids, we will get different results each time we run it.

one simple way of mitigating some of the problem with this, is to run multiple times and use the best clustering we find based on SSE. This is sort of a prayer method, and it has no guarantees.

Question 19

Q

What do we typically do to combat the problems of k-means

Answer

A

k-means++. this is a method for initializing k-means.

The procedure is guaranteed to find a clustering that is optimal within a factor of O(logk). this is really good.

k-means++ picks centroids incrementally until k centroids have been picked. At every step/pick, each point has a probability of being picked as centroid that is proportional to the square of its distance to its closest centroid.

Question 20

Q

What is the space complexity of k-means

Answer

A

O((m+k)n), where m is the number of data points, k is the number of clusters, and n is the number of attributes.

Question 21

Q

What is the time complexity of k-means?

Answer

A

Basically linear in terms of the size of the data set:

O(IxKxmxn).

I is the number of iterations needed for convergence.

Question 22

Q

Name some design issues with k-means

Answer

A

Potential of empty clusters. In such cases, all we really need to do is assign a replacement centroid.

Outliers. we fucking hate outliers, typically we want to eliminate them in preprocessing. the point is that they fuck up the centroid.
However, certain tasks requires the use of outliers. for instance, data compression.

Reducing SSE but not increasing the number of clusters. We know that we will reduce SSE by increasing k. However, increasing k is not always desirable. Therefore, we need methods to decerase SSE while keeping k low. One common approach is to alternate between splitting and merging clusters (POST PROCESSING) based on the lowest SSE.

Question 23

Q

Elaborate on bisecting k means

Answer

A

Bisecting k-means can be very good if we dont know how many clusters we want. We can run it, receivign more and more clusters until the reduction in SSE becomes insignifiucant (elbow). Then, we use these final cluster centroids as initial centroids in regular k-means.

NB: when we use bisecting k-means, we are doing local optimal steps. The global final result is not necessarily the best SSE. Therefore, it is common to use the final result of bisecting k-means to find starting points for regular k-means.

Basically starting out with a singlecluster (all point belong to the same cluster) and then split it. Then we have 2 clusters. Now we want to split ONE OF THEM. We split it based on a splitting criterion, for instance based on SSE or just simply size.
we continue splitting according to the splitting criterion until we reach k clusters.

it is typical to use bisecting k-means to find centroids that we can use for initial cluster-centroid in regualr k-means.

The reason why it is called bisecitng KMEANS is because each time we split, we use k-means. This makes sure that the split is good.
for instance, we have the first cluster consisting of all points. We use k-means to split, find 2 centroids, and get a good split.

Note how bisecting k-means also produce a type of hierarchical clustering.

Question 24

Q

major drawback of k-means?

Answer

A

Tend to not work well on non-spherical shapes, or different densities or different size.

For instance, if one natural cluster is very large compared to the others, k-means will struggle.

why is it struggling with such cases?
It is because of the objective function. Minimizing SSE will always favor globular shapes of equal size

Question 25

Q

Agglomerative vs divisive

Answer

A

Agglomerative starts with each singleton cluster, merge together iteratively to form larger and larger clusters.

Divisive begin at the top and perofrm spltting.

Question 26

Q

how do we visualize HAC?

Answer

A

Dendrograms

Question 27

Q

Elaborate on the HAC algorithm

Answer

A

Starts with singleton clusters.

Initialize the proximity matrix.

MERGE the two clusters that are closest together.

Update the proximity matrix.

Repeat until only one cluster remains.

HAC requires a notion of proximity between clusters. Obviously, because thisi s the basis of the proximity matrix that is used for the algorithm.

Question 28

Q

Elaborate on proximity and the proximity matrix in regards to HAC

Answer

A

The proximity measure is actually usually the thing differing between various methods of HAC.

For instance, we typically use MIN and MAX from the graph based clustering view.

MIN defines cluster proximity as the shortest distance between any node in cluster A and any node in cluster B.

MAX defines cluster proximity as the farthest two points in different clusters.

We also have GROUP AVERAGE which measure proximity as the average of all pairwise connections.

Question 29

Q

What is Wards method?

Answer

A

Assumes clusters are represented by centroids. Considers proximity between two clusters as the increase in SSE that would result from merging the two clusters.

Question 30

Q

Elaborate on single linkage vs complete linkage in terms of abilities and results

Answer

A

Single linkage usually yield stringy and elongated patterns. it excels at capturing weird complex shapes.

Complete linkage usually yield compact and very similar clusters. Tend to be more spherical. Place a greater emphasis on keeping clusters similar. Complete linkage avoids chaining. Chaining is something we typically see with single linkage.

Question 31

Q

Why does k-means suck compared to HAC?

Answer

A

k means suck because it tries to mnimize the objective function in SSE. Because of SSE, it will struggle with varying cluster sizes, densities etc. Just imagine what happens if we have one extremely large natural cluster and other small clusters. k means will split the large one into multiple clusters, because this will most likely reduce SSE. Major drawback.

Complete linkage on the other hand will dominate on them aspects.

Unfortunatley, HAC is very slow and requires lots of space. not good for large sets. Does not scale very well.

Question 32

Q

What is DBSCAN based on?

Answer

A

Center based method

Question 33

Q

Define density, as defined by center based methods

Answer

A

We look at some point, and draw a circle around it with some radius r. All the points that lie inside the circle, INCLUDING the point itself, is regarded as the density level.

Question 34

Q

What is the radius in center based density methods called?

Question 35

Q

What happens if the EPS is very large?

Answer

A

We risk having a case where all point have the entire data set as its dense region.

Question 36

Q

What different points do we have in density and center based methods?

Answer

A

Core points.

Border points.

Noise points.

Question 37

Q

elaborate on core points

Answer

A

The point about core points is that they are well within the dense cluster. In order to achieve this, we say that a point is a core point if its “points within EPS” is greater than, or equal to, some specified level. We call this level “minPts”. Som EPS >= minPts is a requirement for core points.

Question 38

Q

Elaborate on border points

Answer

A

border points are not core points, but they are located so that some (at least one) core point has this point inside of its dense region.
In a sense, the core point connects to this border point.

Question 39

Q

Elaborate on noise points

Answer

A

all points that are neither core nor border

Question 40

Q

give the DBSCAN procedure

Answer

A

We first label all points into categories of core, border and noise.
Then we remove all noise.
Then we put an edge between all core points that are within EPS of each other.
Each group of connected core points constitute a cluster.
Assign each border point into of the clusters of its associted core points.

Question 41

Q

How do find a suitable value for EPS and minPts?

Answer

A

We compute the k-dist, which is the distance to the k’th nearest neighbor. We compute this for all points, and sort them. Then we will see a sudden sharp increase in distance to the k’th nearest neighbor. This distance is what we want to use for EPS

Question 42

Q

what is good about DBSCAN?

Answer

A

Due to being density based, it is very good at finding weird shapes.

Question 43

Q

What is bad about DBSCAN?

Answer

A

It can struggle with different levels of densities. This is because we have to find levels to EPS and minPts. If these values need to be very different in the different clusters, we have a case.

Question 44

Q

Name the important issues regarding cluster evaluation

Answer

A

1) Determining whether there is a clustering tendency in the data. This is basically just figuring out if there is just random data, or if there actually exists non-random structures in the data.

2) Determining the correct number of clusters.

3) Evaluating how well the clusters fit the data without referencing external information.

4) Comparing the cluster results to externally known results, such as externally provided class labels.

5) Comparing two sets of clusters to see which one is better.

1,2,3 are UNSUPERVISED.
4 is supervised.
5 can be both.

Question 45

Q

name an issue regarding cluster evaluation that is not among the 5?

Answer

A

“Do we want to evaluate the entire clustering, or just one specific cluster?”

Question 46

Q

There are 3 types of evaluation measures used in clustering analysis. What are they?

Answer

A

1) Supervised. measure how the structure compares to external structures. Can use entropy in regards to checking how well the clustering “label” different groups.

2) Unsupervised. For instance SSE. Unsupervised measues usually divide into two groups: One is cluster cohesion. The other is cluster separation. Cluster cohesion is about how tight a cluster is. Cluster separation is about how well it distnace itself from the other clusters (isolation).
the benefit of unsupervised measures is that they do NOT require any information that aitn present in the cluster.

3) relative. this is more of a comparison of either supervised or unsupervised measures. it is not really a measure of its own.

Question 47

Q

how can we, generally/abstractly, describe the overall cluster validity?

What is the benefit of this representation+

Answer

A

We use a weighted sum of the individual cluster validities:

Overall Validity = ∑wi x Validity(Ci) [I=1, k]

The benefit of this lies in the flexibility of the “validity(Ci)” function. Validity can be cohesion or it can be separaiton. Thus, the function fits both measures.

Question 48

Q

In graph based views, what is cohesion and what is separation?

Answer

A

Cohesion is proximity between all points inside the cluster.

Separation is proximity between all points in one cluster, and all points in another cluster.

Question 49

Q

In prototype based views, what is cohesion and what is separation+

Answer

A

Cohesion is proximity from all points to the centroid.

Separation is the single proximity between centroids.

note that proximity can be defined as SSE among others.

Question 50

Q

Elaborate on SSB

Answer

A

SSB is defined as “between group sum of squares”.

SSB is a measure we use when we want to measure how “separated” the different clusters are to each other. SSB is calculated by finding the distance between each centroid and the total mean of the entire data set, sum all of these distances together along with their corresponding weight. The weight can be related to relative size.
Based on this, a large value is ideal as it indicate separation to a higher degree. So, if we have say two clusterings, we could compare them on the basis of SSB to understand which one is more separated.

Question 51

Q

What is the important TSS result?

Answer

A

minimizing SSE is the same as maximizing SSB.

This essentially means that TSS = SSE+SSB which means that TSS is a constant. If we reduce SSE, we increase SSB and vice versa.

TSS is the sum of squares from each point to the overall mean of the data.

Question 52

Q

Elaborate on the silhouette coefficient

Answer

A

Method that combines cohesion and separation.

Procedure:
1) For the i’th object, calcualte its average distance to all other point in the cluster. We call this average distance “ai”.
2) For th i’th object, and any other cluster not containing the object, calculate the object’s average distance to all the objects in the given cluster. We do this for all clusters that the object is not in. Find the minimum such value with respect to all clusters. Call this value “bi”.
3) for the i’th object, the silhouette coefficient is given as si = (bi-ai)/max(ai, bi)

the silhouette coefficient value can vary between -1 and 1. A negative value is undesirable because it corresponds to the case where ai is greater than bi. This would mean that the average distance inside a cluster is actually greater than the minimum average distance to points in another cluster.

We want ai to be as close to 0 as possible, which allows silhuette a score of near 1.

Question 53

Q

Silhouette coefficient is a single-point metric. how do we make it cluster representative?

Answer

A

Just use the average silhouette coefficient in a cluster.

We can also extend this to make it a silhouette coefficient of the entire clustering by taking the average of ALL silhouette coefficents.

Question 54

Q

What is cophenetic distance?

Answer

A

proximity at which HAC puts two objects in the same cluster for the first time.

Question 55

Q

TODO: Ward

Answer

A

Wards method defines cluster proximity as the increase in SSE from merging the clusters.

Bercause of this, Ward’s method use the same objective function as k-means.

Question 56

Q

TODO: Group average

Question 57

Q

WHatis the time complexity of hierachical agglomerative clustering?

Answer

A

There are n^2 values in the proximity matrix, and we need to perform n steps.

Therefore, O(n^3).

With some clever manipulation, we can reduce this to O(n^2log(n))

Question 58

Q

How do we find similarity matrix?

Answer

A

for each entry in the proximity matrix, we perform the computation “1 -( [this entry’s value in proximity matrix]-[minimum distance in entire proximity matrix])/([maximum distance in proximity matrix]-[minimum distance in proximity matrix]).

in other words:

entry_in_sim = 1 - (d - dmin)/(dmax - dmin)

The benefit of this is that we can assign for instance color spectrum to the values and watch how similar the clusters are.

We use the matrix of each point to get the full color experience.

Question 59

Q

What is the cophenetic distance matrix?

Answer

A

A matrix where all entries are the cophenetic distances between each other.

Recall cophenetic distance is the proximity at which HAC puts the elements in the same cluster for the first time.

Question 60

Q

Elaborate on the Hopkins statistic

Answer

A

Approach to trying to see if a data set has clusters without clustering. We have to somehow measure spatial randomness.

Hopkins statistic goes like this:

We generate p points that are randomly distributed across the data space and also sample p actual points from the data set.
For both sets of points, we find the distance to the nearest neighbor in the original set data set.

Let ui be artificially generated nearest neighbor distances.
Let wi be the actual nearest neighbior distances.

The statistic (Hopkins) is formula like this:
H = ∑wi / (∑ui + ∑wi)

If the artificial points and the actual sample points have roughly the same nearest neighbor distance, the Hopkins statistic will be approx 0.5. Values close to 0 indicate high level of clustering.

If the data is d-dimensional, we must increase eahc term to the power of d.

Question 61

Q

The best centroid position when using SSE as objective function is…?

Answer

A

The mean of the points in a cluster

Question 62

Q

regarding cluster validation, we abstractly consider the overall validity of a set of K clusters as

overall validity = ∑wi validity(Ci) [i=1, K]

What is the validity function?

Answer

A

Can be many different thinhgs.

Common ones include “cohesion”, “separation” etc.

The weights depend typically on cluster size.

Question 63

Q

In a graph based view of clusterung, how is cohesion and separation defined?

Answer

A

Cohesion: Sum of edgeweights in a cluster. the weights is typically distance.

Separation: Sum of weights on edges going from all points in one cluster, to another cluster.

Question 64

Q

In a prototype based view of clustering, how is cohesion and separation defined?

Answer

A

Cohesion: Sum of proximities from each point to the centroid.

Separation: centroid to centroid distance. alternatively, centroid to mean in data set. This is related to SSB.

Answer 63

A

SSE always want more clusters.
SSE is only measuring cluster cohesion, not separation.
If we use silhouette coefficient, we get cluster separation as well.

Recall that the peak in silhouette coefficient is what we want.

Answer 64

A

All algos will find clusters, regardless of whether there actually are clustering tendencies in the data set.

Answer 65

A

primarily 2 reasons:
1) Understanding
2) Summarization

Answer 66

A

If we have access to run for instance k means many times, we can use the error curve. Meaning, we run k-means first with 1 cluster, then with 2 clusters, then 3 etc, and for each step, we save the SSE, and create a plot. We will usually see a shape that look like a bent elbow, and we want to select the number of clusters that gives real decrease in SSE, but no more. This is the elbow.

If we do not have access to k-means multiple times, we can use the k-nearest distance

Answer 67

A

Pre processing in clustering is basically normalization and removal of outliers.

Answer 68

A

k-means suck.
k.

Answer 69

A

We do nto have assume anythng about the number of clusters.

We can get any wanted number of clusters by cutting the dendrogram at the correct level.

HAC is great at making taxonomies, like animal kingdom.

Answer 70

A

tend to break large clusters, and is biased towards global shapes

Answer 71

A

As SSB, “between cluster sum of squares”.

Formula: SSB = ∑Ci (m - mi)^2 [i=1, k]

Ci er størrelsen på cluster “i”.
m er mean for hele datasettet.
mi er mean i cluster i.

Answer 72

A

Can be. Not good for density based

Answer 73

A

The general hopkins statistic add the power of d to all the distances. This allows us to tweak it and give more or less weighting on smaller vs larger inter-cluster distances. If d is larger than 1, we are effectively saying that we allow the distance between points in a cluster to be larger. Opposite with d<1.

Answer 74

A

Between group sum of squares. It is calculated as the sum (summing over all the clusters) where each term is the cluster size multiplied by the squared distance between the cluster centroid and the overall data set mean.

Brainscape's Knowledge GenomeTM

Clustering final Flashcards

Brainscape's Knowledge Genome^TM