Clustering FINAL Flashcards

(123 cards)

1
Q

Cluster analysis divides data into

A

groups (clusters) that are meaningful, useful, or both

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Objects in a cluster

A

share the same characteristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What fields is clustering used in

A

a variety of fields, health and medicine, business

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is clustering used in health and medicing?

A

Cluster patients according to symptoms presented upon evaluation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is clustering used in business?

A

Cluster stores according to sales/customer satisfaction/…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is clustering used in computer networking?

A

Cluster traffic according to application type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does it mean that clusters are meaningful?

A

Clusters should capture the natural structure of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does it mean that clusters are useful? Using the cluster prototype

A

Using the cluster prototype, clusters can be used as a starting point for data summarization, compression (vector quantization), classification (nearest neighbor)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Clustering groups data objects based on

A

information found in the data that describes the objects and their relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What type of learning is clustering

A

Unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Clustering goal

A

Objects within a cluster be similar to one another, but different from objects in other clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Is the notion of a cluster well defined?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Does clustering have to know exactly what it is sorting

A

No, can sort things like pennies nickels dimes without knowing how much they are worth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Partitional clustering

A

Divide objects into non-overlapping clusters such that each object belongs to exactly one cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Hierarchical clustering

A

Clusters can have subclusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Exclusive Clustering

A

1:1 relationship between object and cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why need hyper parameter for clusteiring

A

Tells us how many clusters we are expecting from dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Overlapping clustering

A

1:n relationship between object and cluster; an object can belong to > 1 cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Fuzzy clustering

A

n:n relationship, all objects belong to all clusters with a certain probability (or membership weight)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In fuzzy clustering, each object’s probability of belonging to all clusters should sum to

A

1.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Complete clustering

A

Assign every point to at least one cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Partial clustering

A

Some objects may not be assigned to any cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What might some objects not assigned to a cluster represent?

A

Noise or outlier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Well-separated clusters

A

Each point is closer to all of the points in its cluster than to any point in another cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Center-based clusters
Each point is closer to the center of its cluster than to the center of any other cluster
26
Density based clusters
identifies clusters as regions of high data point density separated by regions of low density. It is useful for discovering clusters of arbitrary shapes and can handle noise effectively.
27
Conceptual clusters
Points in a cluster share some general property that derives from the entire set of points
28
K-means clustering definition
A prototype-based partitional clustering techniques to find patterns in the data to create k clusters
29
Hierarchical clustering
Build a hierarchy of clusters starting from singleton clusters
30
DBSCAN
Density based clustering
31
K-means clustering process
Takes n observations and partitions them into k clusters
32
In k means clustering, relationship between k and n
k<
33
What type of clustering scheme is K-means
Prototype-based
34
How to use supervised techniques from unsupervised learning
Derive labels from unsupervised learning, then create a data set with the labels
35
Meaning of prototype based clustering scheme
it has a notion of a centroid, which in the literature is called a prototype, and we want all of our points to be closer to the centroid
36
How difficult is k means clustering?
Computationally difficult (NP-hard)
37
Prerequisites of K-means algorithm
Attributes must be numeric Attributes must be standardized (scaled)
38
K-means algorithm
Select K points as initial centroids Form K clusters by assigning each point to its closest centroid Recompute the centroid of each cluster repeat until centroids do not change
39
How to compute distances for k-means clustering
Euclidean distance Manhattan Distance
40
When do we stop algorithm for K means clustering
When you have an iteration and the SSE doesn't change from before, we know we have converged, minimized SSE
41
Does K-means always converge?
Yes, it will find minima (perhaps not global) because SSE will always go down
42
How to choose initial centroids?
Random selection In natural cluster
43
Random selection of initial centroid problem
May lead to sub optimal clustering
44
Best way to initially choose centroids
Put all of the centroids close to each other in some dense area of your space
45
Bisecting K-means goal
Form best (optimal) clusters
46
Bisecting K-means idea
To obtain K clusters, split the set of all points into two clusters, select one of the clusters to split, and so on until you have k clusters
47
Which cluster to split?
The largest one or the one with largest error (SSE)
48
Bisecting K-means algorithm
Initialize the list of clusters to contain the cluster consisting of all points Remove a cluster from the list of clusters Bisect the selected cluster using basic K-means Select the two clusters from the bisection with the lowest total SSE Add these two clusters to the list of clusters Repeat until the list of clusters contains K clusters
49
How to choose the k in K-means?
We want the total within cluster sum of squares to be minimal
50
Clustering should exhibit
Low intra-cluster SS high inter-cluster SS
51
Low intra-cluster SS
Cohesion
52
High inter-cluster SS
Separation
53
Within cluster SS
Sum of squares of each point in the cluster to the cluster centroid
54
Total SS
Sum of squares of each point in the dataset to the global cluster mean
55
Between SS
Total SS - total within SS
56
Variance of clustering
Ratio of between SS/Total SS (between 0.0 and 1.0)
57
What does a high ration of between SS/Total SS mean
The more variance is explained by the clusters
58
The higher the variance
the better it is because you are explaining that much percentage of the variance
59
How does K-means handle outliers
Not gracefully, it will try to include these in a cluster
60
What can help K-means
Outlier detection and removal prior
61
in Kmeans how many points are assigned to a cluster?
Every one
62
Kmeans only creates what type of clusters
Globular clusters (round)
63
K-means may result in an
empty cluster
64
How to handle empty cluster problem
Choose the point that is farthest away from any centroid, basically assign outlier to the last empty cluster Another option is to choose a replacement centroid at random from the cluster that has the highest SSE and split it
65
K means has problems when clusters are of differing
sizes, densities, non-globular
66
K-means complexity
Space: O((m+K)n) Time: O(I*K*m*n)
67
What might outliers lead to in k means
Higher SSE and non-representative cluster centroids
68
What are hierarchical clustering inspired by?
The area of taxonomy, where hierarchical structures are common and elements under the same hierarchy automatically constitute a cluster
69
What is not required of hierarchical clustering unlike k means
Choose k a-priori (prior). You look at what it creates and then try to choose a K according to your understanding of the problem
70
Is not having to choose a k a-prior an advantage or disadvantage?
Both
71
Two approaches to hierarchical clustering
Agglomerative (bottom up) Divisive
72
Agglomerative
Each point starts off as individual cluster, and at each step, merge closest pairs of clusters (Need cluster proximity metric)
73
Divisive
All points in one cluster, at each step, split a cluster until singleton clusters of individual points remain (Which cluster to split and how to do the splitting)
74
How is hierarchical clustering depicted
Depicted as tree-like structure called Dendrograms
75
Dendrograms y axis x axis
Labeled with the proximity between clusters, x axis is labels of data points
76
How to interpret dendrogram
Look for closest cluster (in terms of squared distance) and join them. Continue until one cluster left
77
How to pick clusters in dendrogram
Draw horizontal line that intersects k lines, k being the number of clusters you want
78
Hierarchical clustering algorithm
Compute proximity matrix if necessary Merge the closest two clusters Update the proximity matrix to reflect the proximity between the new cluster and original clusters Repeat until only one cluster remains
79
3 ways for merge to be done in hierarchical clustering algorithm
MIN (single link), MAX (complete link), Group average
80
MIN (single link)
Defines the distance between two clusters as the shortest distance between any pair of points in the two clusters
81
MAX (complete Link)
Measures the distance between two clusters as the greatest distance between any pair of points in the two clusters
82
Group Average
Defines the distance between two clusters as the average of all pairwise distances between points in the two clusters
83
Which merging is sensitive to outliers, which one is not sensitive to outliers
MIN(single Link), MAX (complete link)
84
Hierarchical clustering complexity
O(m^2) Space O(m^2) Time using heap --> O(m^2 log m)
85
Does scaling matter in hierarchical clustering
Yes if attributes are wide ranges
86
What dissimilarity measure should be used in HC?
Euclidean Manhattan
87
What linkage to be used in HC?
Same as before, MIN, MAX, Average, depends whether or not your data is susceptible to outliers
88
Unlike k-means, merging decisions are
final, observations may not belong to different centroids over time
89
Even though clustering in general is considered an unsupervised learning method,
it can be used in supervised learning mode
90
Cluster indicate a _ and each object in a certain cluster _
class label, belongs to that class
91
Error measures for supervised clustering are
the same as classification
92
What is DBSCAN?
A density-based clustering algorithms parametrized by a radius/neighborhood and number of neighboring points
93
e-Neighbourhood
Objects within a radius e from a source object
94
Density (DBSCAN)
If e-neighborhood of a source point (object) contains at least MinPts other points (object), then the source point is in a "high-density" area
95
DBSCAN divides all points into
Core point Border point Noise point
96
Core point
A point that has at least MinPts within an e
97
Border point
A point that is not a core point, but is in the neighborhood of a core point
98
Noise point
Points that are neither core or border points
99
Points idea
I pick a point I draw a circle of radius e around that point Anything that falls within the circle is a core point Anything that falls on the boundary is a border point Anything that falls outside the circle is a noise point
100
Density-based clustering: dbscan algorithm steps
Label all points as core, border, or noise points Eliminate all noise points Put an edge between all core points that are within Eps of each other Make each group of connected core points into a separate cluster Assign each border point to one of the clusters of its associated core points
101
How to pick epsilon
Graph of taking the distance of each point in the data set to its nearest 5 neighbors. It does that across all points in data set. Take distance and sort them then look at curve. See where it starts to rise and choose epsilon value of where it starts to rise
102
dbscan complexity
O(m) space O(m^2) time worse case O(m log m) in low dimension space
103
What happens if neighborhood (MinPts) is too small
Then a sparse cluster may be erroneously labeled as noise
104
What happens if neighborhood (MinPts) is too large
Then dense clusters may be merged together, and small clusters may be labeled as noise
105
How many MinPts satisfy for 2 dimensions
4
106
For greater than 2 dimensions, how many MinPts
2*dimensions
107
How many MinPts for large and noisy datasets
large
108
What to do if clusters are too large
Decrease epsilon
109
Too much noise
increase epsilon
110
How to choose epsilon concise
Calculate the average of distances of every point to its k nearest neighbors, k-dist will be large for noise points and look for value where it starts rising
111
Why validate clustering?
Avoid finding random patterns in data Compare two clusters Compare two clustering algorithms
112
Types of validations
Internal External Relative
113
Internal validation
Used to measure the goodness of a clustering structure without respect to external information (SSE for example)
114
External Validation
Match cluster result with external results (class labels)
115
Relative Validation
Used to compare two different clustering algorithms or clusters
116
Two measures of internal validation
Cohesion Separation
117
Cohesion
Measures how close are objects in the same cluster, measure by within cluster SSE)
118
Separation
Measures how well separated are the clusters
119
Silhouette value
Measures the degree of confidence in the clustering assignment of a particular observation
120
Silhouette formula
Si = (Bi-Ai)/max(Bi,Ai)
121
Silhouette width
[-1,1], large Si means observations are well clustered, small Si means observations lies between two clusters, Negative Si means observations probably placed in wrong cluster
122
what is ai
the average distance from the ith point to all the other points in the same cluster as the ith point
123
what is bi
average distance from the ith point to all the other points in the other cluster