K means clustering Flashcards
(27 cards)
What type of learning is K-Means?
Unsupervised learning.
What is the goal of clustering?
To group similar data points based on structure or distance.
What does K in K-Means refer to?
The number of clusters you want to divide your data into.
What is the main assumption of K-Means about clusters?
That clusters are roughly spherical and separable by distance.
What does each cluster in K-Means have?
A centroid, representing the mean of points in that cluster.
How does K-Means assign points to clusters?
By assigning each point to the nearest centroid.
What does K-Means aim to minimize?
The total within-cluster sum of squares (WCSS).
What is the formula for the K-Means objective?
min Σₖ Σᵢ∈Cₖ ‖xᵢ - μₖ‖²
What is a centroid in K-Means?
The mean of all points assigned to a cluster.
How is Euclidean distance computed between two 2D points?
√((x₁ - y₁)² + (x₂ - y₂)²)
What are the steps of the K-Means algorithm?
Initialize centroids, assign points, update centroids, repeat.
When does K-Means stop iterating?
When point assignments no longer change or a max iteration limit is reached.
What shape do K-Means decision boundaries form?
Voronoi cells based on centroid proximity.
What is the complexity of K-Means per iteration?
O(NKD), where N = data points, K = clusters, D = dimensions.
Why can K-Means produce different results on different runs?
Because it uses random initialization of centroids.
What is K-Means++?
A smarter initialization method that spreads out initial centroids.
What problem does K-Means++ solve?
Poor convergence due to bad initial centroid placement.
What is the elbow method used for in K-Means?
To choose the optimal number of clusters based on cost drop-off.
What is the silhouette score used for?
To evaluate how well a point fits in its cluster versus the next closest one.
What does a high silhouette score indicate?
That the point is well-clustered and far from other clusters.
What does a silhouette score near zero mean?
The point lies on the boundary between two clusters.
What does a negative silhouette score indicate?
The point is likely in the wrong cluster.
Why is exhaustive clustering infeasible?
Because the number of possible clusterings grows combinatorially with data size.
What are Stirling numbers used for in this context?
To count the number of ways to partition data into non-empty clusters.