Topic 21 Flashcards

Question 1

Q

Cluster Analysis

Answer

A

An unsupervised learning technique, which aims to uncover inherent structure in the available data by partitioning the data into groups (or “clusters”) on the basis of similarity.

Question 2

Q

Centre Based Cluster

Answer

A

A set of points such that a point in a cluster is closer to the center of the cluster than to the centre of any other cluster

Question 3

Q

Centroid / Medoid

Answer

A

Average of all the points in a cluster / most representative point of a cluster

Question 4

Q

Continguity Based Cluster

Answer

A

A set of points such that a pont in the cluster is closer to one or more other points in that cluster than to any point not in the cluster

Question 5

Q

Partition Clustering

Answer

A

Division of data instances into non-overlapping subsets (clusters) such that each instance is in exactly one subset

Question 6

Q

Heirarchical Clustering

Answer

A

A set of nested clusters organised as a heirarchichal tree

Question 7

Q

Quantifying Cluster Quality

Answer

A

We want: maximised average between-cluster distance and minimised average within cluster distance

Question 8

Q

K-Means Clustering Algorithm

Answer

A

Step 1: select k (number of clusters)
Step 2: randomly select k initial cluster centers (or cluster centroids)
Step 3: calculate distance from each data point to each cluster center(e.g.,Euclidean distance)
Step 4: assign each data point to the closest cluster center (centroid) – This includes minimising the within-cluster sum of squares (WCSS) (i.e., variance)
Step 5: calculate new centroids as the mean of the data points that belong to the centroid of the previous step.
Repeat Step 3-5 until a final stop condition.

Question 9

Q

Calculating Centroids (cluster’s mean point)

Answer

A

mj = 1/|Cj| xi where |Cj| is the number of data points in Cj

Question 10

Q

Strengths of K-Means Clustering

Answer

A

*
Simple: easy to understand and to implement
*
Efficient: scales linearly in number of datapoints and number of cluster centroids.

Question 11

Q

Weaknesses of K-Means Clustering

Answer

A

*
The algorithm is only applicable if the mean is defined (i.e. numeric feature dimensions; suitable distance metric)
*
For categorical data, k-mode clustering can be used the centroid is represented by the most frequently occurring feature value
*
The user needs to specify k. The best choice of k often very difficult to specify in advance
*
The algorithm is sensitive to outliers
*
Outliers are data points that are very far away from other data points
*
Outliers could be errors in data collection or some special data points with very different values

Topic 21 Flashcards

(11 cards)