L12 Flashcards
(27 cards)
What is Clustering?
A form of unsupervised learning aimed at grouping similar data points together and discovering structures in unlabeled data.
Often used for data exploration, preprocessing, and feature extraction.
What are the goals of Clustering?
Identify coherent groups in data, determine the number of groups, partition data for further analysis, enable unsupervised feature extraction.
Popular methods include K-means, Hierarchical Clustering, Density-based methods (DBSCAN), and Mixture Models (e.g., Gaussian Mixtures).
What is the objective of K-Means Clustering?
Partition data into k clusters by minimizing within-cluster variance.
What are the steps of the K-Means algorithm?
- Choose the number of clusters k
- Randomly initialize k centroids
- Assign each point to its nearest centroid
- Recompute centroids as the mean of assigned points
- Repeat until convergence.
What is the objective function of K-Means?
Minimize the sum of squared distances between each point and its assigned centroid.
What are the limitations of K-Means Clustering?
Assumes convex clusters, can’t capture complex shapes, requires k to be known, not suitable for clusters with different densities or sizes.
What does Scikit-learn’s KMeans do by default?
Performs 10 random restarts.
What are the steps of MiniBatchKMeans?
- Randomly draw a mini-batch
- Assign points to nearest centroid
- Update centroids incrementally
- Repeat until convergence.
How can K-means be used for feature extraction?
Use cluster membership as a new categorical feature and distance to centroids as a continuous feature.
What is Agglomerative Clustering?
A bottom-up clustering method where each point starts as its own cluster and merges with the closest clusters iteratively.
What is a dendrogram in Hierarchical Clustering?
A tree-like diagram that represents the hierarchy of clusters.
What are the linkage methods in Hierarchical Clustering?
- Single linkage
- Complete linkage
- Average linkage
- Ward’s method.
What are the pros and cons of Hierarchical Clustering?
Pros: Reveals hierarchical structure, doesn’t require predefined number of clusters. Cons: Can be slow for large datasets, some linkage methods may cause imbalanced clusters.
What are the key concepts of DBSCAN?
- Core point: has ≥ min_samples within radius ε
- Border point: fewer than min_samples, but near a core
- Noise point: not core, not border.
What is the DBSCAN algorithm?
- Pick a core point
- Expand cluster by visiting neighbors within ε
- Recursively include new core points
- Stop when no more neighbors qualify.
What are the advantages of DBSCAN?
Finds arbitrarily shaped clusters and can detect outliers.
What are the limitations of DBSCAN?
Requires tuning of ε, sensitive to density changes, not ideal for high-dimensional data.
What does a Gaussian Mixture Model (GMM) assume?
Data is generated from a mixture of probabilistic distributions, with each point belonging to one latent component.
What is the Expectation-Maximization (EM) algorithm in GMM?
- E-step: assign probabilities of membership
- M-step: update means and covariances.
What are the advantages of GMM?
Models overlapping clusters, captures cluster shape, scale, and orientation.
What is the Elbow Plot used for in clustering evaluation?
Plot sum of squared distances (SSE) vs. number of clusters to identify the optimal k.
What is the Silhouette Coefficient (S)?
A measure of how similar a point is to its own cluster compared to other clusters, calculated as S = (b - a) / max(a, b).
What do Silhouette Coefficient values near 1 indicate?
Well-clustered points.
What are the strengths and limitations of K-means?
Strengths: Simple, efficient. Limitations: Requires k, assumes spherical clusters.