Week 4: Clustering Flashcards

1
Q

Clustering

A

Can overlapping or non-overlapping regions. Generally clusters are made by finding groupings of instances that have small intra-cluster distances and larger inter-cluster distances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Dendrogram

A

Represents a hierarchy of clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

K-Means Clustering

A

Instances are assigned to the closest centroid, with centroid recomputed using the averages of all the members in its cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Density

A

The density of a data point x is the number of data points within a specified radius Eps, including x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Categorisation of Data Points

A

Core Point: It must have a density above the threshold MinPts.
Border Point: It’s not a core point, but within the neighbourhood of a core point.
Noise Point: It’s neither a core point nor a border point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

DBSCAN Algorithm

A

It’s a density-based clustering algorithm. Identifies high density regions separate by low density regions between them. It ignore noise points.

Finding the optimal Eps and MinPts is a key issue. Can use the elbow rule when plotting i-th nearest neighbour distance against points sorted by distance to i-th nearest neighbour.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Graph-based Algorithm

A

Given a dataset, construct a proximity matrix and sparse graph (proximity matrix and sparse graph are built together iteratively), and partition the graph, with the partitions becoming clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sparsification

A

After clustering the nodes, filter out the edges so that the structure is accentuated in the sparsified graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Graph Partitioning

A

Graph cutting algorithms can be utilised, with the goal of finding a subset of edges whose deletion cuts a graph into k-connected components, with the weights of the connecting nodes minimised (minimum k-cut).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Incremental Clustering

A

It approaches each cluster instance 1-by-1 in an online manner without needing simultaneous access to all instances.

Main points:
1. At each stage, the overall clustering forms a tree.
2. The root node represents the entire dataset.
3. Leaves correspond to instances
4. Nodes with the same parent form a cluster.

Steps:
1. Initialise a tree containing only the root node.
2. With the addition of a new instances, either a leaf for the new instance is added, or the tree is restructured by merging/splitting two nodes. Decisions are made based on the category utility metric, which measure the overall quality of a partition of instances into clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Fuzzy Cluster

A

Clusters whose elements have degrees of membership.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Distance Metrics

A

3 Main Types:

  1. Distance between the values of 2 different instances with respect to an individual attribute.
  2. Distance between two difference instances which aggregates all attributes.
  3. Distance between two different clusters which aggregates all instances.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cluster Centroid for Numerical Data

A

Calculate the arithmetic mean for each attribute using all cluster points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cluster Medoid for Nominal Data

A

The most representative point for a cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Cluster Diameter

A

The largest distance between any two points in a cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Single-Link/Single Linkage

A

The minimum distance between any two points in the two different clusters.

17
Q

Complete-Link/Complete-Linkage

A

The maximum distance between any two points in the two different clusters.

18
Q

Average-Link/Average-Linkage/Group Average

A

The average distance among all pairs of instances in the two clusters.

19
Q

Centroid-Link/Centroid-Linkage

A

The distance of the two centroids of the two clusters.

20
Q

Within Cluster Score

A

The measure of cluster compactness, which is sum of square of the distance between a point and its cluster centroid.

21
Q

Between Cluster Score

A

Measure of cluster separation, which is sum of the square of the distances between each pair of centroids.

22
Q

Calinski-Harabaz Index

A

Based on the sum of squared distances of the points to their closest centroids. High for when the clusters are well separated and dense.

(\frac{BC}{WC}) \times (\frac{n-K}{K-1})

n = number of instances
K = number of clusters
BC = between cluster sum of squared distances
WC = within cluster sum of squared distances

23
Q

Silhouette Coefficient

A

s_i = \frac{b_i - a_i}{\max \left{a_i, b_i \right}}

s_i = 1, for dense and well-separated clusters
s_i = 0, for overlapping cluster
s_i = -1, for incorrect clustering

a_i = \frac{1}{m_k - 1} \sum_{i^{‘} C_k \ \left{i\right}} d(\boldsymbol{x}i, \boldsymbol{x}{i^{‘}})

a_i is the mean distance between \boldsymbol{x}_i and all other data points in the same cluster C_k

b_i = \frac{1}{m_{k^{‘}}} \sum_{i^{‘} \in c_{k^{‘}}} d(\boldsymbol{x}i, \boldsymbol{x}{i^{‘}})

B_i is the mean distance between an instance an all other points in the nearest cluster C_{k^{‘}}

24
Q

Minimum Description Length (MDL)

A

The best model is the simplest. The fewer bits it takes to encode the model, the better. We want to maximise the posterior probability Pr(h \mid D), i.e. the chance that the model h is correct given the set of examples D.

25
Q

Outlier/Anomaly Detection

A

In supervised learning, human experts can do this manually via classification.

In semi-supervised learning, collecting data with outliers is more difficult. But normal instances are more easily attainable.

In unsupervised learning, there needs to be assumptions for defining when a data point is far from other data.

26
Q

Model-Based Outlier

A

An observation that differs so much from other observations to arouse suspicion that it has been generated by a different mechanism.

27
Q

Proximity-Based Outlier

A

Object that have very high distance from its k-nearest neighbour: O = \left{\boldsymbol{x}_i : d_k(\boldsymbol{x}) \ge T, 1 \le I \le n \right}

28
Q

Density

A

The inverse of the average distance to the k-nearest neighbors.

\delta_k (\boldsymbol{x}) = \left( \frac{1}{\left\lvert N_k(\boldsymbol{x}) \right\vert} \sum_{\boldsymbol{x}^{‘} \in N_k(\boldsymbol{x})} d(\boldsymbol{x}, \boldsymbol{x}^{‘}) \right)^{-1}

29
Q

Relative Density

A

This helps find outliers even when density differs across different regions.

r_k(\boldsymbol{x}) = \frac{\delta_k(\boldsymbol{x})}{\frac{1}{\left\lvert N_k (\boldsymbol{x}) \right\rvert}\sum_{\boldsymbol{x}^{‘} \in N_k(\boldsymbol{x})} \delta_k(\boldsymbol{x}^{‘})}

30
Q

Cluster-Based Outlier

A

An object that doesn’t strongly belong to any cluster.

31
Q

Probabilistic Outlier

A

An object that has low probability with respect to a probabilistic model of the data.