Clustering Flashcards
(13 cards)
What is clustering?
Unsupervised learning
Dimensionality reduction
Finds hidden structure in unlabelled data
Why cluster?
Detect outliers
Simplify data
Visualise data
Goal of clusters in clustering?
Maximise intra-cluster similarity
Minimise inter-cluster similarity
Clustering vs Classification
Classification: discriminate against groups based on attributes
Clustering: determine these discriminatory attributes
What are the 2 types of clustering?
- Partitional
2. Hierarchical
Partitional?
division of data into non overlapping clusters
K-means clustering
Hierarchical?
division of data into overlapping clusters
Dendrogram
Agglomerative - bottom up
Divisive - top down
What is K means clustering?
Partition clustering
Select K clusters in advance - disadvantage
Easy to implement and quick - advantage
Whats the algorithm huh???
For each point x:
Find nearest centroid c - euclidean distance
Assign x to c
For each cluster c:
Recalculate as average of all associated points
What are the convergence criteria?
No/min point reassignments
No/min changes in centroids
No/min change in SSE
What is SSE?
Some of square errors
Calculates the sum of squared distances between points in a cluster and the centroid of said cluster
Benefits of K-means?
Simple
Fast - O(TKN)
Always converges
Disadvantages of K-means?
Need to specify k in advance
Only applicable if mean is defined
(If data is categorical centroids can be represented by the mode)
Sensitive to outliers
Cannot be used for hyper ellipsoids/spheres