22. Clustering Flashcards

Question 1

Q

What is clustering?

Answer

A

Clustering is the process of grouping a set of examples into classes of similar objects. It is the most common form of unsupervised learning.

Question 2

Q

What are clusters?

Answer

A

Clusters are groups of examples with similar features. A good cluster has high intra-cluster similarity and low inter-cluster similarity.

Question 3

Q

What are the key steps in clustering?

Answer

A

Determine a good representation of clusters.
Establish a notion of similarity/distance between samples.
Determine the number of clusters to classify.

Question 4

Q

What is a Gaussian Mixture Model (GMM)?

Answer

A

A GMM is a probabilistic model representing data as a mixture of several Gaussian distributions with unknown parameters.

Question 5

Q

Why are mixture models useful in clustering?

Answer

A

Mixture models, such as GMMs, provide more precise representations of data by combining multiple distributions.

Question 6

Q

What is the Expectation-Maximization (EM) algorithm?

Answer

A

The EM algorithm is an iterative method to find maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables.

Question 7

Q

What are the steps of the EM algorithm?

Answer

A

Expectation step (E-step): Estimate the expected value of the log-likelihood with current parameters.
Maximization step (M-step): Maximize this expected log-likelihood to update parameters.

Question 8

Q

What is the difference between Hard EM and Soft EM?

Answer

A

Hard EM assigns each data point to one cluster deterministically, while Soft EM assigns probabilities of belonging to each cluster.

Question 9

Q

What are the properties of a good distance measure?

Answer

A

A good distance measure must be:
1. Positive: d(x, y) ≥ 0
2. Symmetrical: d(x, y) = d(y, x)
3. Identity: d(x, y) = 0 iff x = y
4. Triangle inequality: d(x, z) ≤ d(x, y) + d(y, z).

Question 10

Q

What are some examples of distance measures?

Answer

A

Examples include:
1. Manhattan distance: d(X, Y) = |X - Y|
2. Euclidean distance: d(X, Y) = √Σ(xᵢ - yᵢ)²
3. Chebyshev distance: d(X, Y) = max(|X - Y|)
4. Cosine distance: d(X, Y) = 1 - (X · Y) / (‖X‖‖Y‖)

Question 11

Q

What is the difference between distance and similarity measures?

Answer

A

A distance measure quantifies how far apart two objects are, while a similarity measure quantifies how alike they are. Similarity can often be interpreted as the inverse of distance.

Question 12

Q

What is the formula for Gaussian Mixture Models (GMMs)?

Answer

A

P(x | θ) = Σ wₖ N(x | μₖ, Σₖ), where 0 ≤ wₖ ≤ 1, Σ wₖ = 1, and θ = {μₖ, Σₖ, wₖ for k = 1…C}.

Question 13

Q

What is the Kullback-Leibler (KL) divergence?

Answer

A

The KL divergence measures the difference between two probability distributions:
KL(X || Y) = Σ X log(X / Y). It is used in the EM algorithm to optimize the lower bound of the log-likelihood.

Question 14

Q

How does K-means relate to the EM algorithm?

Answer

A

K-means is a special case of the Hard EM algorithm, where each data point is assigned to the closest cluster centroid (E-step), and centroids are updated to the mean of their assigned points (M-step).

Question 15

Q

What are the termination conditions for the EM algorithm?

Answer

A

Termination conditions include:
1. A fixed number of iterations.
2. Assigned labels do not change.
3. Centroid positions do not change.

Question 16

Q

What are the limitations of K-means clustering?

Answer

Study These Flashcards

A

Limitations include:
1. The number of clusters must be known.
2. It may get stuck in local maxima.
3. Convergence depends on initialization.
4. It may overfit the observed data.

Question 17

Q

What is the role of the E-step in the EM algorithm for GMMs?

Answer

Study These Flashcards

A

In the E-step, the algorithm computes the responsibility of each cluster for each data point:
zᵢₖ = (wₖ N(oᵢ | μₖ, Σₖ)) / Σ wⱼ N(oᵢ | μⱼ, Σⱼ)

Question 18

Q

What is the role of the M-step in the EM algorithm for GMMs?

Answer

Study These Flashcards

A

In the M-step, the algorithm updates the model parameters (weights, means, and covariances) to maximize the likelihood given the current responsibilities.

22. Clustering Flashcards

(18 cards)