22. Clustering Flashcards
(18 cards)
What is clustering?
Clustering is the process of grouping a set of examples into classes of similar objects. It is the most common form of unsupervised learning.
What are clusters?
Clusters are groups of examples with similar features. A good cluster has high intra-cluster similarity and low inter-cluster similarity.
What are the key steps in clustering?
- Determine a good representation of clusters.
- Establish a notion of similarity/distance between samples.
- Determine the number of clusters to classify.
What is a Gaussian Mixture Model (GMM)?
A GMM is a probabilistic model representing data as a mixture of several Gaussian distributions with unknown parameters.
Why are mixture models useful in clustering?
Mixture models, such as GMMs, provide more precise representations of data by combining multiple distributions.
What is the Expectation-Maximization (EM) algorithm?
The EM algorithm is an iterative method to find maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables.
What are the steps of the EM algorithm?
- Expectation step (E-step): Estimate the expected value of the log-likelihood with current parameters.
- Maximization step (M-step): Maximize this expected log-likelihood to update parameters.
What is the difference between Hard EM and Soft EM?
Hard EM assigns each data point to one cluster deterministically, while Soft EM assigns probabilities of belonging to each cluster.
What are the properties of a good distance measure?
A good distance measure must be:
1. Positive: d(x, y) ≥ 0
2. Symmetrical: d(x, y) = d(y, x)
3. Identity: d(x, y) = 0 iff x = y
4. Triangle inequality: d(x, z) ≤ d(x, y) + d(y, z).
What are some examples of distance measures?
Examples include:
1. Manhattan distance: d(X, Y) = |X - Y|
2. Euclidean distance: d(X, Y) = √Σ(xᵢ - yᵢ)²
3. Chebyshev distance: d(X, Y) = max(|X - Y|)
4. Cosine distance: d(X, Y) = 1 - (X · Y) / (‖X‖‖Y‖)
What is the difference between distance and similarity measures?
A distance measure quantifies how far apart two objects are, while a similarity measure quantifies how alike they are. Similarity can often be interpreted as the inverse of distance.
What is the formula for Gaussian Mixture Models (GMMs)?
P(x | θ) = Σ wₖ N(x | μₖ, Σₖ), where 0 ≤ wₖ ≤ 1, Σ wₖ = 1, and θ = {μₖ, Σₖ, wₖ for k = 1…C}.
What is the Kullback-Leibler (KL) divergence?
The KL divergence measures the difference between two probability distributions:
KL(X || Y) = Σ X log(X / Y). It is used in the EM algorithm to optimize the lower bound of the log-likelihood.
How does K-means relate to the EM algorithm?
K-means is a special case of the Hard EM algorithm, where each data point is assigned to the closest cluster centroid (E-step), and centroids are updated to the mean of their assigned points (M-step).
What are the termination conditions for the EM algorithm?
Termination conditions include:
1. A fixed number of iterations.
2. Assigned labels do not change.
3. Centroid positions do not change.
What are the limitations of K-means clustering?
Limitations include:
1. The number of clusters must be known.
2. It may get stuck in local maxima.
3. Convergence depends on initialization.
4. It may overfit the observed data.
What is the role of the E-step in the EM algorithm for GMMs?
In the E-step, the algorithm computes the responsibility of each cluster for each data point:
zᵢₖ = (wₖ N(oᵢ | μₖ, Σₖ)) / Σ wⱼ N(oᵢ | μⱼ, Σⱼ)
What is the role of the M-step in the EM algorithm for GMMs?
In the M-step, the algorithm updates the model parameters (weights, means, and covariances) to maximize the likelihood given the current responsibilities.