Topic 21 Flashcards
(11 cards)
Cluster Analysis
An unsupervised learning technique, which aims to uncover inherent structure in the available data by partitioning the data into groups (or “clusters”) on the basis of similarity.
Centre Based Cluster
A set of points such that a point in a cluster is closer to the center of the cluster than to the centre of any other cluster
Centroid / Medoid
Average of all the points in a cluster / most representative point of a cluster
Continguity Based Cluster
A set of points such that a pont in the cluster is closer to one or more other points in that cluster than to any point not in the cluster
Partition Clustering
Division of data instances into non-overlapping subsets (clusters) such that each instance is in exactly one subset
Heirarchical Clustering
A set of nested clusters organised as a heirarchichal tree
Quantifying Cluster Quality
We want: maximised average between-cluster distance and minimised average within cluster distance
K-Means Clustering Algorithm
Step 1: select k (number of clusters)
Step 2: randomly select k initial cluster centers (or cluster centroids)
Step 3: calculate distance from each data point to each cluster center(e.g.,Euclidean distance)
Step 4: assign each data point to the closest cluster center (centroid) – This includes minimising the within-cluster sum of squares (WCSS) (i.e., variance)
Step 5: calculate new centroids as the mean of the data points that belong to the centroid of the previous step.
Repeat Step 3-5 until a final stop condition.
Calculating Centroids (cluster’s mean point)
mj = 1/|Cj| xi where |Cj| is the number of data points in Cj
Strengths of K-Means Clustering
*
Simple: easy to understand and to implement
*
Efficient: scales linearly in number of datapoints and number of cluster centroids.
Weaknesses of K-Means Clustering
*
The algorithm is only applicable if the mean is defined (i.e. numeric feature dimensions; suitable distance metric)
*
For categorical data, k-mode clustering can be used the centroid is represented by the most frequently occurring feature value
*
The user needs to specify k. The best choice of k often very difficult to specify in advance
*
The algorithm is sensitive to outliers
*
Outliers are data points that are very far away from other data points
*
Outliers could be errors in data collection or some special data points with very different values