L13 - Hierarchical Structuring and DBSCAN Flashcards

1
Q

What is hierarchical clustering?

A
  • An unsupervised learning method for clustering similar data points.
  • The goal is to represent degrees of similarity within data.
  • Enables the partition of data at various levels.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the structure of hierarchical clustering?

A
  • Clusters and sub-clusters.
  • Results in tree-like structure.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are Dendrograms?

A

Tree diagrams that show a hierarchy of clusters, where each node represents a cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In Dendrograms, what are leaves called?

A

Singletons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 2 approaches of creating a Dendrogram?

A
  1. Bottom up -> Agglomerative
    1. Top down -> Divisive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is a Dendrogram created using the Agglomerative method?

A
  1. Starting with a set of individual data points
  2. Iteratively:
    1. Create distance matrix between points
    2. Merge close ones
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is a dendrogram created using Divisive method?

A
  • Start with a large set of clustered data points
  • Iteratively split the data points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the time complexity of Hierarchical clustering?

A

O(N^3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When hierarchical clustering, what is the only thing we need to be able to create?

A

Distance matrix between points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why can’t we use Brute force to calculate Distance Matrix?

A

Too many points means too high complexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 5 methods we can use to calculate distance between clusters?

A
  1. Simple Linkage -> Distance defined as the distance between 2 closest data points of 2 separate clusters.
  2. Complete Linkage -> Distance defined as the distance between 2 furthest data points of 2 separate clusters.
  3. Average Linkage -> Distance is defined as the average distance between all members of each clusters.
  4. Centroid Linkage -> Distance defined by the distance between the centroids of each cluster.
  5. Wards Method -> Join clusters only if it reduces the total distance from he centroids.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is an issue with each of the 5 distance calculation methods?

A
  1. Simple Linkage -> Can lead to a long chain of clusters
  2. Complete Linkage -> Often breaks large clusters into 2 or more
  3. Average Linkage -> High computational cost and complexity due to every data point needing to be visited.
  4. Centroid Linkage -> Biased towards spherical clusters
  5. Wards Method -> Biased towards spherical clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What method do we use to know if we have a good cluster count?

A

Elbow method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is DBSCAN?

A

Not to compare every point (such as hierarchical clustering), but to have a density threshold around each point, and if the threshold area contains enough other data points, the point can be known as a Dense Point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 2 hyper parameters of DBSCAN?

A

Epsilon -> Radius of density threshold

MinPts -> Min no of points need in threshold to be considered a dense point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How are points categorised in DBSCAN?

A

Core point -> If a point has more than a specified number MinPts in its Epsilon. These points are in the interior of a cluster

Border point -> I the point has fewer points that MinPts in it’s Epsilon, but is in the neighbourhood of a core point

Noise point -> Any point that is not a core or border point

17
Q

What is the result of DBSCAN?

A

All points within a cluster can reach one another through steps of size Epsilon

18
Q

What are the pros and cons of DBSCAN?

A

Pros -> Resistant to noise. Can handle clusters of different shapes and sizes.

Cons -> Eps and MinPts interact and can be hard to specify.

19
Q

What are 2 limitations of DBSCAN?

A
  • Struggles to cluster with varying densities
  • Highly dependent on Eps and MinPts selection