Lecture 11 - K-Means and DBSCAN Flashcards
(14 cards)
What is Clustering?
Clustering has no specific definition, in most cases it based on the context as different algorithms do different kinds of clusters, for example
REFER TO SLIDES
What is K-means
Divide points in K groups (a number of groups) where each group is made of points that are close to that group.
How Does K-means Work?
1️⃣ Define (manually) a suitable number (k) of clusters. (need to know your data)
2️⃣ Start by placing these k centroids randomly (e.g., by randomly picking k instances from the dataset and use them as the centroids).
3️⃣ Assign each data instance to the closest¹ cluster centroid. ( using a distance measure)
4️⃣ Update each cluster centroid based on the data instances that have been assigned to it. (you now need to update the centroid as the center might be off)
5️⃣ Go to Step 3 and repeat until the cluster centroids stop moving. (repeat)
What is the Voronoi Diagram and how is it used?
Use to plot the models decision boundary
On the diagram you can see that the centers are perpendicular to each other as well as the same distance apart
What is Hard clustering and Soft clustering?
Hard Clustering: Each instance (points) is assigned to one cluster (and one only)
Soft Clustering: Each instance (point) is given a score for each cluster, even though it receives a score for each cluster, it tends to go towards the one with the higher score
What are the 3 drawbacks of K-means?
Sensitive to the initialisation of cluster centroids
An optimal number of clusters need to be known
K-Means does not behave well when the clusters have varying sizes, densities or nonspherical shapes
K-means Drawback: Sensitive to the initialisation of cluster centroids
While the algorithm always converges, it may not converge to the optimal soltuion
The ways to improve it:
- If you know approximately where the cluster centroids should be, supply them using the good_init parameter
- Run the algorithm multiple times with different intialisation and keep the best soltuions. Can be done using n_init (sklearn does this by deafault and is set at 10)
- Using K-Means++
This tends to pick the centroids that are far away from each other, which targets the initialise the step
□ For each remaining data point:
○ Measure how far it is from the nearest center already chosen.
○ The farther a point is, the more chance it has to be picked as the next center.
□ Repeat until all centers are chosen
○ Keep repeating Step 2 until you have picked k centers.
○ Then you can start the normal k-means algorithm.
K-means Drawback: An optimal number of clusters need to be known
An optimal number of clusters need to be known, in general we dont know this number and we Do not want to use inertia as it is not a measure of determining k (look at the clustering between k = 3 and k = 8)
The ways to improve:
- In other words as k increases, inertia decreases, but we can graph it to get the curve and look at the elbow (using the elbow curve, where the elbow in the curve tells you the optimal point)
- Silhouette coeffiecient, which tells you how well each point fits into its cluster compared to other clusters.
How does the Silhouette Coeffecient work?
Tells you how well each point fits into its cluster compared to other clusters.
Uses the formula (b-a) / (max(a,b))
where:
a = how close the point is to other points in its own cluster (intra-cluster distance).
b = how close the point is to points in the next nearest cluster (i.e., the closest other cluster).
===
Intuition:
If a point is much closer to its own cluster than to any other → silhouette score is close to 1 (good clustering).
If a point is equally close to its own cluster and another → score is close to 0 (uncertain).
If a point is closer to another cluster → score is negative (bad clustering).
===
How its used:
Compute the silhouette score for each point.
Take the average over all points → this is your silhouette score for that k..
Try different values of k and choose the one with the highest average silhouette score.
K-means Drawback: K-Means does not behave well when the clusters have varying sizes, densities or nonspherical shapes
It does not behave well with varying sizes, densities and non-sphereical shapes
The ways to improve it:
Weighted sum of Gaussian
How is Clustering used for Semi-Supervised Learning?
Used to compensate for insufficient labelled instances
What is DBSCAN?
The density-based spatial clustering of applications with noise (DBSCAN) algorithm defines clusters as continuous regions of high density. The algorithm involves two parameters: epsilon and min_samples.
How does DBSCAN work?
- Look around each point:
- Check how many other points are within a small radius (ϵ) of it. This is its “ϵ-neighborhood.”
- Core point check:
- If a point has at least min_samples neighbors in its ϵ-neighborhood → it’s a core point (meaning it’s in a dense area).
- Forming clusters:
- Group all core points that are close to each other (directly or through a chain of other core points).
- Also include points around core points (even if they aren’t core) — they’re called border points.
- Find outliers:
- If a point isn’t a core point and also doesn’t fall near any core point — it’s considered noise or an anomaly.
What are the drawbacks of DBSCAN?
If the density varies significantly across the clusters, or if there’s no sufficiently low-density region around some clusters, DBSCAN can struggle to capture all the clusters properly.
The computational complexity of DBSCAN is roughly O(m^2n). The quadratic term m^2 means that the algorithm does not scale well to large datasets.