Clustering Flashcards

1
Q

Clustering Critereon

A

1) Distance
2) conceptual (shared attributes)
3) density

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Applications of Clustering

A

Pattern recognition, spatial data analysis, image processing, economic science (market research), WWW

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Good clustering characteristics (optimized)

A

High intra class similarity - minimize intra cluster distance

Low inter class similarity - high intercluster distances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Major clustering approaches

A
  • partitioning algorithms (k-means)
  • hierarchical methods - cluster tree
  • density based
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Partitioning Clustering

A

Must define the number of clusters you want
Global optimal - exhaustively runs all clusters
Heuristic methods - k-means / k-mediods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Hierarchical agglomerative clutering

A

every point starts as its own cluster. At each consecutive layer they merge until only one big cluster is left at the top. It produces a dendrogram that you can cut at any point. Height of bars indicate how close items are.

agglomerative = bottom up

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Similarity measures in clustering

A

Distance based - euclidean, manhattan, minkowski

Correlation distance - the degree two which variables are related

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Inter cluster similarity

A

Inter - between clusters.

Min, max, group average, distance between cetnroids, or some other novel measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Hierarchical clustering issues

A

Distinct cluters are not produced

Methods to cut the dendrogram into clusters exist, but they are somewhat arbitrary

If original data does not have a hierarchical structure, it may be the completely wrong fit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

K Means clustering algorithm

A

Step 0: Start with random partition into k clusters (pick k datapoints as starting cluster)

Step 1: Generate a new cluster by assigning each data point to its closest cluster center

Step 2: Compute new centroids

Step 3: repeat steps 1 and 2 until there is no change in membership

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

K means - cluster # optimizaiton

A

Elbow plot

Reduction in variation

Number of clusters

_________
| /
| /
| /
|/
|_________________________________

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Properties of K-means

A

Guaranteed to converge
Guaranteed to achieve local optimal but not necessarily global optimal

Pros
Low Complexity

Cons
Must specify K
Sensitive to noise / outliers
Clusters sensitive to initial random points chosen - can result in different clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Density based clustering method

A

Local cluster criterion

Major features:
can handle data of arbitrary shape
handles noisy data well
one scan needed
Need density parameters as termination condition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

DB Scan Process

A

Define neighborhood distance N
Defnine minpts (denisty to become a papa cluster)

Categorize points as
Core: has minputs in its N neighborhood
Border: has at least one core point in its neighborhood, but does not meet the minpts criteria)
Noise - all other points

Do a DFS from each core and assign to the same core if it meets the criteria

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Density Clutering Drawbacks

A

Very sensitive to user chooses - N and minpts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly