Clustering final Flashcards
What is clustering all about?
Clustering is about dividing data into groups that are meaningful or useful or both.
If “meaningful” groups of data is the goal, we want the clustering to capture the nature of the data.
however, if we look at clusteirng as “useful”, one can imagine creating clusters with prototypes that are essentially used for summation purposes, aggregation.
What can we say about a cluster?
A cluster will contain data instances that all share certain similarities. They are in many ways equal or similar to each other.
Elaborate on clustering for prototypes
Clustering for prototypes refers to clustering with the goal of creating k prototypes that aggregate the entire information. Instead of having literally millions of data points, we can create k amount of clusters, and receive k prototype points that represent the entire data point collection.
What is the goal of clusteirng
the goal can be described as identifying groups of data so that data in a group is very similar to the other data in the same group, and at the same time is very different from data in other groups/clusters.
Are clusters easily definable?
No. It is not always easy to understand what is and what is not a cluster. There are many different patterns that can be difficult to interpret.
another (informal) word for clustering
unsupervised classification
What is a clustering?
A clustering is the entire scheme of multiple (if k>1) clusters. It is just plural.
Elaborate on hierarchical vs partitional clusterings
Nested vs unnested.
A partitional clusteirng is a clustering where we divide data into non-overlapping subsets.
A hierarchical clustering is overlapping and work at multiple levels. In a hierarchical clustering, each node (which is a clustering) except for the leaf nodes, is the union of all of its children clusters.
Hierarchical clusterings are typically pictured as trees. It is also not uncommon to find them having singleton clusters at the leaf nodes.
Elaborate on exclusive vs overlapping vs fuzzy clusterings
Exclusive: Assign one cluster to each data point, and only one cluster.
Overlapping: Allows data points to be assinged into several clusters. ex: A person can be a studnet AND worker.
Fuzzy: Every data object belongs to every cluster, but with a certain weight that associate them with a “strength”. The strength level is between 0 and 1. 1 is complete inclusion, while 0 is not at all.
Elaborate on complete vs partial clustering
Complete clustering assign every object to some cluster, while the partial clustering does not. The motivation behind partial clusteirngs is that some objects just dont really fit in to the possible clusters, and if we force them to, we fuck up the cluster identify.
there are different types of clusters. Name them
Well-separated. The distance between two points in different groups is larger than the distance between any two points both in the same group.
Well-separated clusterings does not need to have a globular shape.
Prototype based. For prototype based clusterings, each data point in a cluster is in the cluster because it is closer to the prototype of this specific cluster than it is to any other prototype.
the prototype is often a centroid, especially for continuous attributes.
In other cases, ex with categorical data, the prototype can be a medoid.
Prototype based clusterings are also called center-based.
Graph-based. This one is all about how it idenifies cluster membership. Graph based methods are very flexible in the regards that they allow every member of a cluster to take part directly in determining whether a new point should be accepted or not. this is a contrast with the prototype based one considering the fact that prototype based is using a single data point/centroid to make the decisions. Because of this, graph based is able to create non-globular shapes very well. With min link, it can create stringy and elongated shapes.
Graph based methods create clusters are connected components.
Density based. We consider a clustering to be a region that is dense in data points compared to the region outiside of the immediate cluster wall.
there are 3 algorithms of itnerest, which ones
K-means
AHC: Agglomerative Hierarchical Clustering
DBSCAN
Elaborate on k-means procedure
We first choose a number of clusters, k.
We spread out k centroids.
Each point in the entire data set of assigned to its closest centroid.
We update each centroid based on its entire cluster.
We keep the centroids, and re-assign all data points, essentially giving them a chance to change cluster membership to a more suitable one.
Repeat centroid update.
Repeat until convergence.
What is the centroid in k-means
The mean of a cluster
in k-means, when does “most happen”?
In the early stages.
bercause of this, we typically only repeat the k-means procedure steps until convergence is below a certain threshold.
what do we “need” to run k-means?
A proximity measure. For instance the euclidean distance.
Cosine similarity can be more useful for documents.
Others include jaccard distance, manhattan distance.
WE ALSO NEED AN OBJECTIVE FUNCTION. For instance, “minimize the sum of squared distances”.
Elaborate on SSE
Sum of squared error.
We use it like this:
We consider the “error” to be the distance between a data point and its closest prototype. Therefore, the goal is naturally to minimize the total sum of squared errors. Meaning, we sum together the squared error for all of our data points, and use this value as a measure that we want to minimize. The smaller this SSE is, the better our clustering is.
What is the problem with k-means
Difficult to find optimal starting locations for the centroids.
The fact is that when we initalize k-means with random locations for the centroids, we will get different results each time we run it.
one simple way of mitigating some of the problem with this, is to run multiple times and use the best clustering we find based on SSE. This is sort of a prayer method, and it has no guarantees.
What do we typically do to combat the problems of k-means
k-means++. this is a method for initializing k-means.
The procedure is guaranteed to find a clustering that is optimal within a factor of O(logk). this is really good.
k-means++ picks centroids incrementally until k centroids have been picked. At every step/pick, each point has a probability of being picked as centroid that is proportional to the square of its distance to its closest centroid.
What is the space complexity of k-means
O((m+k)n), where m is the number of data points, k is the number of clusters, and n is the number of attributes.
What is the time complexity of k-means?
Basically linear in terms of the size of the data set:
O(IxKxmxn).
I is the number of iterations needed for convergence.
Name some design issues with k-means
Potential of empty clusters. In such cases, all we really need to do is assign a replacement centroid.
Outliers. we fucking hate outliers, typically we want to eliminate them in preprocessing. the point is that they fuck up the centroid.
However, certain tasks requires the use of outliers. for instance, data compression.
Reducing SSE but not increasing the number of clusters. We know that we will reduce SSE by increasing k. However, increasing k is not always desirable. Therefore, we need methods to decerase SSE while keeping k low. One common approach is to alternate between splitting and merging clusters (POST PROCESSING) based on the lowest SSE.
Elaborate on bisecting k means
Bisecting k-means can be very good if we dont know how many clusters we want. We can run it, receivign more and more clusters until the reduction in SSE becomes insignifiucant (elbow). Then, we use these final cluster centroids as initial centroids in regular k-means.
NB: when we use bisecting k-means, we are doing local optimal steps. The global final result is not necessarily the best SSE. Therefore, it is common to use the final result of bisecting k-means to find starting points for regular k-means.
Basically starting out with a singlecluster (all point belong to the same cluster) and then split it. Then we have 2 clusters. Now we want to split ONE OF THEM. We split it based on a splitting criterion, for instance based on SSE or just simply size.
we continue splitting according to the splitting criterion until we reach k clusters.
it is typical to use bisecting k-means to find centroids that we can use for initial cluster-centroid in regualr k-means.
The reason why it is called bisecitng KMEANS is because each time we split, we use k-means. This makes sure that the split is good.
for instance, we have the first cluster consisting of all points. We use k-means to split, find 2 centroids, and get a good split.
Note how bisecting k-means also produce a type of hierarchical clustering.
major drawback of k-means?
Tend to not work well on non-spherical shapes, or different densities or different size.
For instance, if one natural cluster is very large compared to the others, k-means will struggle.
why is it struggling with such cases?
It is because of the objective function. Minimizing SSE will always favor globular shapes of equal size
Agglomerative vs divisive
Agglomerative starts with each singleton cluster, merge together iteratively to form larger and larger clusters.
Divisive begin at the top and perofrm spltting.
how do we visualize HAC?
Dendrograms
Elaborate on the HAC algorithm
Starts with singleton clusters.
Initialize the proximity matrix.
MERGE the two clusters that are closest together.
Update the proximity matrix.
Repeat until only one cluster remains.
HAC requires a notion of proximity between clusters. Obviously, because thisi s the basis of the proximity matrix that is used for the algorithm.
Elaborate on proximity and the proximity matrix in regards to HAC
The proximity measure is actually usually the thing differing between various methods of HAC.
For instance, we typically use MIN and MAX from the graph based clustering view.
MIN defines cluster proximity as the shortest distance between any node in cluster A and any node in cluster B.
MAX defines cluster proximity as the farthest two points in different clusters.
We also have GROUP AVERAGE which measure proximity as the average of all pairwise connections.
What is Wards method?
Assumes clusters are represented by centroids. Considers proximity between two clusters as the increase in SSE that would result from merging the two clusters.
Elaborate on single linkage vs complete linkage in terms of abilities and results
Single linkage usually yield stringy and elongated patterns. it excels at capturing weird complex shapes.
Complete linkage usually yield compact and very similar clusters. Tend to be more spherical. Place a greater emphasis on keeping clusters similar. Complete linkage avoids chaining. Chaining is something we typically see with single linkage.
Why does k-means suck compared to HAC?
k means suck because it tries to mnimize the objective function in SSE. Because of SSE, it will struggle with varying cluster sizes, densities etc. Just imagine what happens if we have one extremely large natural cluster and other small clusters. k means will split the large one into multiple clusters, because this will most likely reduce SSE. Major drawback.
Complete linkage on the other hand will dominate on them aspects.
Unfortunatley, HAC is very slow and requires lots of space. not good for large sets. Does not scale very well.
What is DBSCAN based on?
Center based method