Unsupervised Learning Flashcards

(43 cards)

1
Q

What is unsupervised learning?

A

Learning relationships in the data without having any ground truth to evaluate it with

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why would you want to use an unsupervised learning algorithm?

A

Segmentation, criminal activity identification, identifying new species, creating classes for a classification algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is clustering?

A

Grouping points together based on some distance function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is community detection?

A

Taking a graph of nodes and determining communities of objects using metrics like similarity, distance, eigenvectors etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is topic modelling?

A

A technique of detecting topics within data, by grouping words together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is an example of an unsupervised learning algorithm? (pick one)

A

K-means clustering, DBSCAN, hierarchical clustering, hard/soft clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does DBSCAN cluster objects together?

A

Finding core regions of high density, and expanding clusters from them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does hierarchical clustering cluster objects together?

A

Splitting clusters iteratively into two groups until we have groups of classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does it mean for a clustering algorithm to be hard?

A

Each object belongs in one cluster only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does it mean for a clustering algorithm to be soft?

A

Each object may belong to multiple clusters at once, with corresponding probabilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do we define where data should go in the clustering process? Name one method.

A

A metric like similarity or distance, such as Euclidean, Manhattan, Jaccard distance or Jaccard similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is hierarchical clustering?

A

Clustering by partitioning data into a hierarchy at different levels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a dendrogram?

A

A tree diagram that shows a hierarchy of clusters, where each node on the tree is a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the two methods of creating a dendrogram?

A

Agglomerative and divisive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can we create a dendrogram using an agglomerative method?

A

Starting with each item in its own cluster, we find the best pair to merge into a new cluster using a distance matrix, and repeat until all are fused together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is simple linkage between two clusters in an agglomerative dendrogram?

A

Where we define cluster distance as the distance between the two closest members in each cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is complete linkage between two clusters in an agglomerative dendrogram?

A

Where we define cluster distance as the distance between the two farthest members in each cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is average linkage between two clusters in an agglomerative dendrogram?

A

Where we define cluster distance as the average between all members in each cluster

19
Q

What is centroid linkage between two clusters in an agglomerative dendrogram?

A

Where we define cluster distance as the distance between the two centroids (worst, good, best) of each cluster

20
Q

What is Ward’s method between two clusters in an agglomerative dendrogram?

A

Where we join clusters only if it reduces the total distance from the centroids

21
Q

What is the downside to using simple linkage between two clusters in an agglomerative dendrogram?

A

Favours long, chain-like clusters

22
Q

What is the downside to using complete linkage between two clusters in an agglomerative dendrogram?

A

Breaks clusters into too many subclusters

23
Q

What is the downside to using average linkage between two clusters in an agglomerative dendrogram?

A

Makes a large amount of comparisons, so computationally slow

24
Q

What is the downside to using centroid linkage between two clusters in an agglomerative dendrogram?

A

Biased towards spherical clusters

25
What is the downside to using Ward's method between two clusters in an agglomerative dendrogram?
Biased towards spherical clusters
26
What are the two hyperparameters used in the DBSCAN algorithm?
Radius and minimum number of points
27
When is a point a core point in DBSCAN?
If a point has more than the minimum number of points within its radius
28
When is a point a border point in DBSCAN?
If a point has less than the minimum number of points within its radius, but is within the neighbourhood of a core point
29
When is a point a noise point in DBSCAN?
If a point is not a core point or border point
30
What are the benefits to using DBSCAN?
It is resistant to noise, and can handle weirdly shaped clusters
31
What are the downsides to using DBSCAN?
It struggles with varying densities, and is hyperparameter reliant
32
What are partitional clustering methods?
Methods that split our space into partitions, corresponding to clusters
33
What is K-means clustering?
A partitional clustering method that uses distance to compute centroid clusters
34
How does K-means clustering define clusters?
We initially choose a number of random data points to be initial cluster centroids, then iteratively recompute the centroids using some distance metric, and check for convergence
35
What is the downside to K-means clustering?
It assumes that each cluster will be of similar size, struggling when they are not, plus the results depend on the starting centroids, meaning hyperparameter tuning must be performed
36
What is the Gaussian Mixture Model (GMM)?
A partitional clustering method where each cluster is represented by a probability distribution
37
What is clustering validation?
Using some metric to determine how successful clustering was
38
What is an external validation method of clustering?
Measuring how clustering labels compare to externally supplied labels
39
What is an internal validation method of clustering?
Measuring whether points that should be close actually are or not
40
What is an relative validation method of clustering?
Comparing one clustering to another, and seeing if they agree
41
What is cluster cohesion for clustering validation? Is it [internal/external/relative]?
A validation technique that measures the distance within a cluster. It is internal
42
What is cluster separation for clustering validation? Is it [internal/external/relative]?
A validation technique that measures the distance between clusters. It is internal
43
What is the silhouette coefficient for clustering validation? Is it [internal/external/relative]?
A validation technique that measures the mean distance to the nearest cluster minus the distance within clusters, being a mix between cohesion and separation. It is internal