Topic 21 Flashcards

(11 cards)

1
Q

Cluster Analysis

A

An unsupervised learning technique, which aims to uncover inherent structure in the available data by partitioning the data into groups (or “clusters”) on the basis of similarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Centre Based Cluster

A

A set of points such that a point in a cluster is closer to the center of the cluster than to the centre of any other cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Centroid / Medoid

A

Average of all the points in a cluster / most representative point of a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Continguity Based Cluster

A

A set of points such that a pont in the cluster is closer to one or more other points in that cluster than to any point not in the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Partition Clustering

A

Division of data instances into non-overlapping subsets (clusters) such that each instance is in exactly one subset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Heirarchical Clustering

A

A set of nested clusters organised as a heirarchichal tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Quantifying Cluster Quality

A

We want: maximised average between-cluster distance and minimised average within cluster distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

K-Means Clustering Algorithm

A

Step 1: select k (number of clusters)
Step 2: randomly select k initial cluster centers (or cluster centroids)
Step 3: calculate distance from each data point to each cluster center(e.g.,Euclidean distance)
Step 4: assign each data point to the closest cluster center (centroid) – This includes minimising the within-cluster sum of squares (WCSS) (i.e., variance)
Step 5: calculate new centroids as the mean of the data points that belong to the centroid of the previous step.
Repeat Step 3-5 until a final stop condition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Calculating Centroids (cluster’s mean point)

A

mj = 1/|Cj| xi where |Cj| is the number of data points in Cj

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Strengths of K-Means Clustering

A

*
Simple: easy to understand and to implement
*
Efficient: scales linearly in number of datapoints and number of cluster centroids.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Weaknesses of K-Means Clustering

A

*
The algorithm is only applicable if the mean is defined (i.e. numeric feature dimensions; suitable distance metric)
*
For categorical data, k-mode clustering can be used the centroid is represented by the most frequently occurring feature value
*
The user needs to specify k. The best choice of k often very difficult to specify in advance
*
The algorithm is sensitive to outliers
*
Outliers are data points that are very far away from other data points
*
Outliers could be errors in data collection or some special data points with very different values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly