All Flashcards
(21 cards)
What is unsupervised learning?
A type of machine learning where algorithms find patterns in data without labeled examples.
What are the main types of unsupervised learning?
Clustering, dimensionality reduction, and association rule learning.
How does K-means clustering work?
1) Place K random centroids, 2) Assign points to nearest centroid, 3) Move centroids to average of their points, 4) Repeat until convergence.
What is the ““elbow method””?
A technique to find the optimal number of clusters by plotting K against inertia (within-cluster sum of squares) and looking for where the curve bends.
What does silhouette analysis measure?
How well-separated clusters are by calculating how similar each point is to its own cluster compared to other clusters.
What is the Gap Statistic?
A method to find the optimal K by comparing clustering performance on real data versus random data with no cluster structure.
Why is a null distribution important in the Gap Statistic?
It provides a baseline of what clustering would look like in random data with no natural clusters.
What is dimensionality reduction?
Transforming high-dimensional data into a lower-dimensional representation while preserving important information.
What is the ““curse of dimensionality””?
As dimensions increase, data becomes sparse, distances lose meaning, and algorithms become less effective.
What is Principal Component Analysis (PCA)?
A linear technique that reduces dimensions by projecting data onto directions of maximum variance.
How does t-SNE differ from PCA?
t-SNE is non-linear, focuses on preserving local structure, and is better for visualization but doesn’t preserve global relationships as well.
What are autoencoders?
Neural networks that compress data into fewer dimensions in the middle layer and then reconstruct the original data.
What is the main purpose of UMAP?
A dimensionality reduction technique that preserves both local and global structure better than t-SNE and is faster on large datasets.
How does association rule learning work?
Discovers interesting relationships between variables in large datasets (e.g., “customers who buy X often also buy Y”).
What’s the difference between hierarchical clustering and K-means?
Hierarchical builds a tree of clusters, doesn’t require specifying K in advance, and can capture nested cluster structures.
What does the ““inertia”” measure in K-means?
The sum of squared distances between data points and their assigned cluster centroids (lower is better).
How do you interpret silhouette scores?
Scores near +1 indicate well-defined clusters, 0 indicates overlapping clusters, and -1 suggests points are in the wrong cluster.
When would you use DBSCAN instead of K-means?
When clusters have irregular shapes, different densities, or when you don’t know the number of clusters in advance.
What is feature extraction?
Creating new features from original features to better represent underlying patterns (a form of dimensionality reduction).
What’s the difference between hard and soft clustering?
Hard clustering assigns each point to exactly one cluster; soft clustering (like fuzzy c-means) gives points membership degrees to multiple clusters.