07_unsupervised learning Flashcards
What is the goal for unsupervised problems?
fins a transformation (T)
that builds a compact internal representation
of unlabeled data (x)
to unveil its internal structure
- we want a mathematical function that does something for you, model that builds a representation, way to represent data in an efficient way
What is the difference between supervised and unsupervised learning?
in unsupervised learning…
- no labels are required, whereas those are crucial in supervised learning
- no specific task is learned (no classification, regression etc)
- train/val/test splits are not as common, but can still be useful depending on your goals
(split data sets still allow you to evaluate how well your model generalizes to unseen data)
Why use unsupervised learning?
unsupervised learning unveils structure in the data, which we want to exploit to make better use of it
- clustering
- dimensionality reduction
What is the goal of clustering in unsupervised learning?
reveal agglomerates in the data set that might indicate populations of samples with distinct properties
What is the goal of dimensionality reduction in unsupervised learning?
reduce the number of dimensions in the data for better interpretability and/or better performance of supervised learning approaches
What is Clustering?
is the task of grouping a set of objects
in such a way that objects in the same group (called a cluster)
are more similar (in some sense)
to each other than to those in other groups (clusters)
- identify latent (not observable) classes inherent to the data
- identify taxonomies in the underlying data
- identify noise/outliers
What are different approaches to clustering? (4)
- centroid-based: clusters are represented as mean vectors
- connectivity-based: clusters are defined based on distance connectivity
- distribution-based: clusters are defined by statistical distributions
- density-based: clusters are high density regions
What is a centroid-based clustering method?
k-means clustering
clusters are represented by vectors that point to their centers (centroids)
What is k-means clustering?
similarly to nn-models and many other clustering methods, k-means clustering is based on a DISTANCE METRIC
eg. euclidian distance metric
k-means does not see labels in a dataset
What are the steps to implement k-means algorithm?
0) initialization: pick k random data points as cluster center locations
1) cluster assignment: compute distances between cluster ceonters and data points; assign each data point to closest cluster
2) recompute cluster centers: recompute cluster center locations from all cluster members
3) reiterate steps 1 and 2
4) termination: terminate algorithm if cluster assignments do not change anymore –> it self-organizes itself
What happens if cluster colors are mismatched with k-means results?
some data points are impossible to recover
–> k-means clusters resemble ground-truth labels very much, however it ignores mismatch in the clusters: k-means is unsupervised
What is the right value of k in k-means discussion?
-qualitative approach: visual inspection and human intuition (works well on low-dimensional and well-behaved data)
-quantitative approach: BIC (see hierarchical clustering)
What are limitations of k-means method?
- relying on a distance metric, k-means intrinsically expects radially symmetric (“gaussian”) clusters
- proper data scaling is crucial
What is a distribution-based clustering method?
expectation maximization clustering
What is a big disadvantage of k-means/agglomerative clustering?
hard cluster assignment
–> each data point is assigned to a single cluster, no information on likelihood
how to compute likelihoods for each data point to belong to individual clusters?
–> EM (expectation maximization) clustering
How can we compute the likelihoods for each data point to belong to an individual cluster?
EM clustering
(expectation maximization clustering)
What is expectation maximization clustering?
class affiliation probability for each of k clusters can be approximated by a Gaussian N (x; µi, Summe i) ;
EM Clustering is therefore a parametric clustering method
probability for data point x to belong to cluster i?
Intuition: if x-µi small, probability that it belongs to cluster i is large
if x-µi large, probability that it belongs to cluster i is small
we treat the data distribution as a mixture of Gaussians (Gaussian mixture model)
What are the steps to implement expectation maximization clustering?
0) initialize parameters: pick random data points for centroids µi, and adopt default values for covariances Summe i
1) “expectation step”: for each data point j calculate probability to belong to cluster i
2) “maximization step”: recalculate model parameters to better fit the previously derived probability distributions (maximize probability)
–> changes the centroids and also the shape of the clusters
3) reiterate steps 1 and 2
4) termination: terminate algorithm if cluster assignments do not change anymore
What are the differences on the assumptions on the data from k-means clustering and expectation maximization clustering?
k-means:
distribution of radially symmetric (gaussian) clusters
EM:
distributions of elongated (multi-normal) clusters
How do k-means and Expectations maximization clustering relate?
similar algorithm, iterative two-step process for fitting of model
(k-means is actually a special case of EM)
with K-means, covariants are fixed and you only change the centroids
What is the difference on cluster assignments between k-means and expectation maximization clustering?
k-means:
Hart Assignment (each data point is assigned to one cluster)
EM:
Soft assignment (each data point has a probability associated to each cluster)
What is a connectivity-based clustering method?
hierarchical clustering
What is hierarchical clustering?
builds a hierarchy of clusters based on the provided data set. the number of found clusters, k, varies in each iteration and can be considered its hyperparameter.
hierarchical clustering is non-parametric
assumes that data point that are close to each other are more likely to be part of the same cluster
What are two different hierarchical- clustering approaches that work very similarly?
- agglomerative clustering (bottom-up)
–> clusters are merged based on a distance function - divisive clustering (top-down)
–> clusters are split based on distance function