Chapter 9 Flashcards
(26 cards)
What is the main limitation of supervised learning mentioned in the presentation?
It requires labeled data, which is often unavailable or expensive to obtain.
What is the goal of unsupervised learning?
To analyze data without labels and discover hidden structures like clusters or anomalies.
What are common applications of clustering?
Customer segmentation, data analysis, anomaly detection, semi-supervised learning, and image segmentation.
What is the K-Means clustering algorithm?
An algorithm that partitions data into k clusters by minimizing the distance between instances and cluster centroids.
What are the main steps in the K-Means algorithm?
Randomly place centroids, assign points to nearest centroid, compute new centroids, repeat until convergence.
What does ‘inertia’ measure in K-Means?
The mean squared distance between each instance and the nearest cluster centroid.
What is K-Means++ initialization?
A method that chooses initial centroids that are far apart to improve clustering performance.
What is the difference between hard and soft clustering?
Hard clustering assigns each instance to one cluster; soft clustering assigns a score or probability for each cluster.
How can you speed up K-Means on large datasets?
Use Mini-Batch K-Means, which updates centroids using small random subsets of the data.
What is the elbow method in K-Means?
A technique to determine the optimal number of clusters by identifying where inertia stops decreasing significantly.
What is the silhouette score?
A metric that measures how similar an instance is to its own cluster compared to other clusters, ranging from -1 to +1.
What are the limitations of K-Means?
It performs poorly on clusters with varying sizes, densities, and non-spherical shapes.
How is clustering used for image segmentation?
By grouping pixels with similar colors into clusters to separate regions in an image.
How can clustering improve supervised learning?
By reducing dimensionality or generating features like distances to cluster centroids.
What is semi-supervised learning with clustering?
Using few labeled instances to label entire clusters or propagate labels to nearby points.
What is DBSCAN?
A density-based clustering algorithm that groups together points in high-density regions and marks outliers.
What are the key parameters in DBSCAN?
eps (neighborhood radius) and min_samples (minimum number of neighbors to form a core point).
What is a core point in DBSCAN?
A point with at least min_samples neighbors within its eps-radius.
What is the main advantage of DBSCAN over K-Means?
DBSCAN can find clusters of arbitrary shapes and automatically detect outliers.
What is a Gaussian Mixture Model (GMM)?
A probabilistic model assuming data is generated from a mixture of several Gaussian distributions.
How does GMM differ from K-Means?
GMM uses soft clustering with probability distributions, while K-Means uses hard assignments based on distance.
What are the shapes that GMM can model?
Spherical, diagonal, or tied (same shape, size, and orientation) clusters.
How is anomaly detection performed using GMM?
By identifying points that fall in low-density regions of the Gaussian distribution.
What criteria can be used to select the number of GMM components?
Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC).