04. K-Means Clustering Flashcards

1
Q

What is clustering

A

Unsupervised, labels are not proscribed in advance. Clustering is exploratory analysis and does not predict, but determines the objects of interest and how best to group them. Clustering methods find the similarities between objects according to the object attributes and group the similar objects into clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does k-means clustering do

A

K-Means clustering partitionsnobjects intokclusters in which each object belongs to the cluster with the nearest mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

List use cases for k-means clustering

A

Image Processing, Medical Attributes, Customer Segmentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

List the steps of the k-means algorithm

A
  1. Choose/calculate the number of clusters k
    2 Select k points at random as clusters
  2. Calculate the distance between objects and every centre (in n dimensions)
  3. Create clusters by assigning objects to their closest cluster
  4. Calculate the centroid of mean of all objects in each cluster
  5. Repeat steps 3, 4 & 5 until the same points are assigned to each cluster in consecutive iterations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Name the theorem and name of the distance in k-means clustering

A

In two dimensions the Euclidian Distance is the same as in the Pythagorean Theorem SQRT(x2+y2) but in more dimensions this becomes SQRT(x2+y2+z2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In R what is the syntax for k-means clustering

A

OutputDataFrame = kmeans(InputDataFrame, 3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In k-means clustering what does WSS stand for

A

Within Sum of Squares. Looking for the lowest WSS is used in assessing if its worth increasing or decreasing the chosen number of clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In k-means clustering when looking at a “Within Sum of Squares” (WSS) graph how do you determine the optimal number of clusters

A

At clusters = 1 the WSS is high, this reduces rapidly as the count of clusters increase, however, the slope flattens out. Look for the elbow where the slope flattens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which three things remain as choices for the user to define for a k-means clustering

A

What object attributes should be included in the analysis?
What unit of measure (for example, miles or kilometers) should be used for each attribute?
Do the attributes need to be rescaled so that one attribute does not have a disproportionate effect on the results?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can you improve the underlying data set for a k-means clustering

A

Reduce the number of attributes where possible. See if there are any strong correlations between attributes (do a scatter plot and look for linear relationships) and either remove or combine these attributes (in a ratio).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why does r run multiple k-means methods for the answering of one question

A

The randomly chosen starting centroid can have an effect on the answer, hence, you need to undertake a number of iterations with different starting points to have a more effective result

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What types of data does k-means handle well

A

Numerical data, but not categorical data (ordinal, nominal, binomial)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the R syntax for reviewing a k-means clustering model

A

OutputDataFrame…
…$centers cluster means
…$size objects per cluster
…$cluster vector of objects in clusters
…$betweenss the between clusters sum of squares
….$withinss the within cluster sum of squares
…$totss withinss + betweenss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly