L12 Flashcards by jolyn Unknown

What is Clustering?

A form of unsupervised learning aimed at grouping similar data points together and discovering structures in unlabeled data.

Often used for data exploration, preprocessing, and feature extraction.

How well did you know this?

Not at all

Perfectly

What are the goals of Clustering?

Identify coherent groups in data, determine the number of groups, partition data for further analysis, enable unsupervised feature extraction.

Popular methods include K-means, Hierarchical Clustering, Density-based methods (DBSCAN), and Mixture Models (e.g., Gaussian Mixtures).

How well did you know this?

Not at all

Perfectly

What is the objective of K-Means Clustering?

Partition data into k clusters by minimizing within-cluster variance.

How well did you know this?

Not at all

Perfectly

What are the steps of the K-Means algorithm?

Choose the number of clusters k
Randomly initialize k centroids
Assign each point to its nearest centroid
Recompute centroids as the mean of assigned points
Repeat until convergence.

How well did you know this?

Not at all

Perfectly

What is the objective function of K-Means?

Minimize the sum of squared distances between each point and its assigned centroid.

How well did you know this?

Not at all

Perfectly

What are the limitations of K-Means Clustering?

Assumes convex clusters, can’t capture complex shapes, requires k to be known, not suitable for clusters with different densities or sizes.

How well did you know this?

Not at all

Perfectly

What does Scikit-learn’s KMeans do by default?

Performs 10 random restarts.

How well did you know this?

Not at all

Perfectly

What are the steps of MiniBatchKMeans?

Randomly draw a mini-batch
Assign points to nearest centroid
Update centroids incrementally
Repeat until convergence.

How well did you know this?

Not at all

Perfectly

How can K-means be used for feature extraction?

Use cluster membership as a new categorical feature and distance to centroids as a continuous feature.

How well did you know this?

Not at all

Perfectly

What is Agglomerative Clustering?

A bottom-up clustering method where each point starts as its own cluster and merges with the closest clusters iteratively.

How well did you know this?

Not at all

Perfectly

What is a dendrogram in Hierarchical Clustering?

A tree-like diagram that represents the hierarchy of clusters.

How well did you know this?

Not at all

Perfectly

What are the linkage methods in Hierarchical Clustering?

Single linkage
Complete linkage
Average linkage
Ward’s method.

How well did you know this?

Not at all

Perfectly

What are the pros and cons of Hierarchical Clustering?

Pros: Reveals hierarchical structure, doesn’t require predefined number of clusters. Cons: Can be slow for large datasets, some linkage methods may cause imbalanced clusters.

How well did you know this?

Not at all

Perfectly

What are the key concepts of DBSCAN?

Core point: has ≥ min_samples within radius ε
Border point: fewer than min_samples, but near a core
Noise point: not core, not border.

How well did you know this?

Not at all

Perfectly

What is the DBSCAN algorithm?

Pick a core point
Expand cluster by visiting neighbors within ε
Recursively include new core points
Stop when no more neighbors qualify.

How well did you know this?

Not at all

Perfectly

What are the advantages of DBSCAN?

Study These Flashcards

Finds arbitrarily shaped clusters and can detect outliers.

What are the limitations of DBSCAN?

Study These Flashcards

Requires tuning of ε, sensitive to density changes, not ideal for high-dimensional data.

What does a Gaussian Mixture Model (GMM) assume?

Study These Flashcards

Data is generated from a mixture of probabilistic distributions, with each point belonging to one latent component.

What is the Expectation-Maximization (EM) algorithm in GMM?

Study These Flashcards

E-step: assign probabilities of membership
M-step: update means and covariances.

What are the advantages of GMM?

Study These Flashcards

Models overlapping clusters, captures cluster shape, scale, and orientation.

What is the Elbow Plot used for in clustering evaluation?

Study These Flashcards

Plot sum of squared distances (SSE) vs. number of clusters to identify the optimal k.

What is the Silhouette Coefficient (S)?

Study These Flashcards

A measure of how similar a point is to its own cluster compared to other clusters, calculated as S = (b - a) / max(a, b).

What do Silhouette Coefficient values near 1 indicate?

Study These Flashcards

Well-clustered points.

What are the strengths and limitations of K-means?

Study These Flashcards

Strengths: Simple, efficient. Limitations: Requires k, assumes spherical clusters.

What are the strengths and limitations of Hierarchical Clustering?

Strengths: Shows full structure, no need for k. Limitations: Slow, can form imbalanced clusters.

What are the strengths and limitations of DBSCAN?

Strengths: Finds arbitrary shapes, detects outliers. Limitations: Sensitive to density, poor in high-D.

What are the strengths and limitations of GMM?

Strengths: Soft clustering, handles overlap. Limitations: Complex, needs good initialization.

L12 Flashcards

(27 cards)