L12 Flashcards

(27 cards)

1
Q

What is Clustering?

A

A form of unsupervised learning aimed at grouping similar data points together and discovering structures in unlabeled data.

Often used for data exploration, preprocessing, and feature extraction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the goals of Clustering?

A

Identify coherent groups in data, determine the number of groups, partition data for further analysis, enable unsupervised feature extraction.

Popular methods include K-means, Hierarchical Clustering, Density-based methods (DBSCAN), and Mixture Models (e.g., Gaussian Mixtures).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the objective of K-Means Clustering?

A

Partition data into k clusters by minimizing within-cluster variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the steps of the K-Means algorithm?

A
  1. Choose the number of clusters k
  2. Randomly initialize k centroids
  3. Assign each point to its nearest centroid
  4. Recompute centroids as the mean of assigned points
  5. Repeat until convergence.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the objective function of K-Means?

A

Minimize the sum of squared distances between each point and its assigned centroid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the limitations of K-Means Clustering?

A

Assumes convex clusters, can’t capture complex shapes, requires k to be known, not suitable for clusters with different densities or sizes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does Scikit-learn’s KMeans do by default?

A

Performs 10 random restarts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the steps of MiniBatchKMeans?

A
  1. Randomly draw a mini-batch
  2. Assign points to nearest centroid
  3. Update centroids incrementally
  4. Repeat until convergence.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can K-means be used for feature extraction?

A

Use cluster membership as a new categorical feature and distance to centroids as a continuous feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Agglomerative Clustering?

A

A bottom-up clustering method where each point starts as its own cluster and merges with the closest clusters iteratively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a dendrogram in Hierarchical Clustering?

A

A tree-like diagram that represents the hierarchy of clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the linkage methods in Hierarchical Clustering?

A
  1. Single linkage
  2. Complete linkage
  3. Average linkage
  4. Ward’s method.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the pros and cons of Hierarchical Clustering?

A

Pros: Reveals hierarchical structure, doesn’t require predefined number of clusters. Cons: Can be slow for large datasets, some linkage methods may cause imbalanced clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the key concepts of DBSCAN?

A
  1. Core point: has ≥ min_samples within radius ε
  2. Border point: fewer than min_samples, but near a core
  3. Noise point: not core, not border.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the DBSCAN algorithm?

A
  1. Pick a core point
  2. Expand cluster by visiting neighbors within ε
  3. Recursively include new core points
  4. Stop when no more neighbors qualify.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the advantages of DBSCAN?

A

Finds arbitrarily shaped clusters and can detect outliers.

17
Q

What are the limitations of DBSCAN?

A

Requires tuning of ε, sensitive to density changes, not ideal for high-dimensional data.

18
Q

What does a Gaussian Mixture Model (GMM) assume?

A

Data is generated from a mixture of probabilistic distributions, with each point belonging to one latent component.

19
Q

What is the Expectation-Maximization (EM) algorithm in GMM?

A
  1. E-step: assign probabilities of membership
  2. M-step: update means and covariances.
20
Q

What are the advantages of GMM?

A

Models overlapping clusters, captures cluster shape, scale, and orientation.

21
Q

What is the Elbow Plot used for in clustering evaluation?

A

Plot sum of squared distances (SSE) vs. number of clusters to identify the optimal k.

22
Q

What is the Silhouette Coefficient (S)?

A

A measure of how similar a point is to its own cluster compared to other clusters, calculated as S = (b - a) / max(a, b).

23
Q

What do Silhouette Coefficient values near 1 indicate?

A

Well-clustered points.

24
Q

What are the strengths and limitations of K-means?

A

Strengths: Simple, efficient. Limitations: Requires k, assumes spherical clusters.

25
What are the strengths and limitations of Hierarchical Clustering?
Strengths: Shows full structure, no need for k. Limitations: Slow, can form imbalanced clusters.
26
What are the strengths and limitations of DBSCAN?
Strengths: Finds arbitrary shapes, detects outliers. Limitations: Sensitive to density, poor in high-D.
27
What are the strengths and limitations of GMM?
Strengths: Soft clustering, handles overlap. Limitations: Complex, needs good initialization.