Chapter 9 Flashcards by Gourgey Hats

What is the main limitation of supervised learning mentioned in the presentation?

It requires labeled data, which is often unavailable or expensive to obtain.

How well did you know this?

Not at all

Perfectly

What is the goal of unsupervised learning?

To analyze data without labels and discover hidden structures like clusters or anomalies.

How well did you know this?

Not at all

Perfectly

What are common applications of clustering?

Customer segmentation, data analysis, anomaly detection, semi-supervised learning, and image segmentation.

How well did you know this?

Not at all

Perfectly

What is the K-Means clustering algorithm?

An algorithm that partitions data into k clusters by minimizing the distance between instances and cluster centroids.

How well did you know this?

Not at all

Perfectly

What are the main steps in the K-Means algorithm?

Randomly place centroids, assign points to nearest centroid, compute new centroids, repeat until convergence.

How well did you know this?

Not at all

Perfectly

What does ‘inertia’ measure in K-Means?

The mean squared distance between each instance and the nearest cluster centroid.

How well did you know this?

Not at all

Perfectly

What is K-Means++ initialization?

A method that chooses initial centroids that are far apart to improve clustering performance.

How well did you know this?

Not at all

Perfectly

What is the difference between hard and soft clustering?

Hard clustering assigns each instance to one cluster; soft clustering assigns a score or probability for each cluster.

How well did you know this?

Not at all

Perfectly

How can you speed up K-Means on large datasets?

Use Mini-Batch K-Means, which updates centroids using small random subsets of the data.

How well did you know this?

Not at all

Perfectly

What is the elbow method in K-Means?

A technique to determine the optimal number of clusters by identifying where inertia stops decreasing significantly.

How well did you know this?

Not at all

Perfectly

What is the silhouette score?

A metric that measures how similar an instance is to its own cluster compared to other clusters, ranging from -1 to +1.

How well did you know this?

Not at all

Perfectly

What are the limitations of K-Means?

It performs poorly on clusters with varying sizes, densities, and non-spherical shapes.

How well did you know this?

Not at all

Perfectly

How is clustering used for image segmentation?

By grouping pixels with similar colors into clusters to separate regions in an image.

How well did you know this?

Not at all

Perfectly

How can clustering improve supervised learning?

By reducing dimensionality or generating features like distances to cluster centroids.

How well did you know this?

Not at all

Perfectly

What is semi-supervised learning with clustering?

Using few labeled instances to label entire clusters or propagate labels to nearby points.

How well did you know this?

Not at all

Perfectly

What is DBSCAN?

Study These Flashcards

A density-based clustering algorithm that groups together points in high-density regions and marks outliers.

What are the key parameters in DBSCAN?

Study These Flashcards

eps (neighborhood radius) and min_samples (minimum number of neighbors to form a core point).

What is a core point in DBSCAN?

Study These Flashcards

A point with at least min_samples neighbors within its eps-radius.

What is the main advantage of DBSCAN over K-Means?

Study These Flashcards

DBSCAN can find clusters of arbitrary shapes and automatically detect outliers.

What is a Gaussian Mixture Model (GMM)?

Study These Flashcards

A probabilistic model assuming data is generated from a mixture of several Gaussian distributions.

How does GMM differ from K-Means?

Study These Flashcards

GMM uses soft clustering with probability distributions, while K-Means uses hard assignments based on distance.

What are the shapes that GMM can model?

Study These Flashcards

Spherical, diagonal, or tied (same shape, size, and orientation) clusters.

How is anomaly detection performed using GMM?

Study These Flashcards

By identifying points that fall in low-density regions of the Gaussian distribution.

What criteria can be used to select the number of GMM components?

Study These Flashcards

Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC).

What other clustering algorithms are mentioned in the presentation?

Agglomerative clustering, BIRCH, Mean-Shift, Affinity Propagation, and Spectral Clustering.

What unsupervised algorithms are used for anomaly detection?

PCA, Fast-MCD, Isolation Forest, Local Outlier Factor, and One-Class SVM.

Chapter 9 Flashcards

(26 cards)