Unsupervised Learning Flashcards
(43 cards)
What is unsupervised learning?
Learning relationships in the data without having any ground truth to evaluate it with
Why would you want to use an unsupervised learning algorithm?
Segmentation, criminal activity identification, identifying new species, creating classes for a classification algorithm
What is clustering?
Grouping points together based on some distance function
What is community detection?
Taking a graph of nodes and determining communities of objects using metrics like similarity, distance, eigenvectors etc.
What is topic modelling?
A technique of detecting topics within data, by grouping words together
What is an example of an unsupervised learning algorithm? (pick one)
K-means clustering, DBSCAN, hierarchical clustering, hard/soft clustering
How does DBSCAN cluster objects together?
Finding core regions of high density, and expanding clusters from them
How does hierarchical clustering cluster objects together?
Splitting clusters iteratively into two groups until we have groups of classes
What does it mean for a clustering algorithm to be hard?
Each object belongs in one cluster only
What does it mean for a clustering algorithm to be soft?
Each object may belong to multiple clusters at once, with corresponding probabilities
How do we define where data should go in the clustering process? Name one method.
A metric like similarity or distance, such as Euclidean, Manhattan, Jaccard distance or Jaccard similarity
What is hierarchical clustering?
Clustering by partitioning data into a hierarchy at different levels
What is a dendrogram?
A tree diagram that shows a hierarchy of clusters, where each node on the tree is a cluster
What are the two methods of creating a dendrogram?
Agglomerative and divisive
How can we create a dendrogram using an agglomerative method?
Starting with each item in its own cluster, we find the best pair to merge into a new cluster using a distance matrix, and repeat until all are fused together
What is simple linkage between two clusters in an agglomerative dendrogram?
Where we define cluster distance as the distance between the two closest members in each cluster
What is complete linkage between two clusters in an agglomerative dendrogram?
Where we define cluster distance as the distance between the two farthest members in each cluster
What is average linkage between two clusters in an agglomerative dendrogram?
Where we define cluster distance as the average between all members in each cluster
What is centroid linkage between two clusters in an agglomerative dendrogram?
Where we define cluster distance as the distance between the two centroids (worst, good, best) of each cluster
What is Ward’s method between two clusters in an agglomerative dendrogram?
Where we join clusters only if it reduces the total distance from the centroids
What is the downside to using simple linkage between two clusters in an agglomerative dendrogram?
Favours long, chain-like clusters
What is the downside to using complete linkage between two clusters in an agglomerative dendrogram?
Breaks clusters into too many subclusters
What is the downside to using average linkage between two clusters in an agglomerative dendrogram?
Makes a large amount of comparisons, so computationally slow
What is the downside to using centroid linkage between two clusters in an agglomerative dendrogram?
Biased towards spherical clusters