Clustering FINAL Flashcards

Question

Center-based clusters

Answer 1

Each point is closer to the center of its cluster than to the center of any other cluster

Answer 2

identifies clusters as regions of high data point density separated by regions of low density. It is useful for discovering clusters of arbitrary shapes and can handle noise effectively.

Answer 3

Points in a cluster share some general property that derives from the entire set of points

Answer 4

A prototype-based partitional clustering techniques to find patterns in the data to create k clusters

Answer 5

Build a hierarchy of clusters starting from singleton clusters

Answer 6

Density based clustering

Answer 7

Takes n observations and partitions them into k clusters

Answer 8

Prototype-based

Answer 9

Derive labels from unsupervised learning, then create a data set with the labels

Answer 10

it has a notion of a centroid, which in the literature is called a prototype, and we want all of our points to be closer to the centroid

Answer 11

Computationally difficult (NP-hard)

Answer 12

Attributes must be numeric Attributes must be standardized (scaled)

Answer 13

Select K points as initial centroids Form K clusters by assigning each point to its closest centroid Recompute the centroid of each cluster repeat until centroids do not change

Answer 14

Euclidean distance Manhattan Distance

Answer 15

When you have an iteration and the SSE doesn't change from before, we know we have converged, minimized SSE

Answer 16

Yes, it will find minima (perhaps not global) because SSE will always go down

Answer 17

Random selection In natural cluster

Answer 18

May lead to sub optimal clustering

Answer 19

Put all of the centroids close to each other in some dense area of your space

Answer 20

Form best (optimal) clusters

Answer 21

To obtain K clusters, split the set of all points into two clusters, select one of the clusters to split, and so on until you have k clusters

Answer 22

The largest one or the one with largest error (SSE)

Answer 23

Initialize the list of clusters to contain the cluster consisting of all points Remove a cluster from the list of clusters Bisect the selected cluster using basic K-means Select the two clusters from the bisection with the lowest total SSE Add these two clusters to the list of clusters Repeat until the list of clusters contains K clusters

Answer 24

We want the total within cluster sum of squares to be minimal

Answer 25

Low intra-cluster SS high inter-cluster SS

Answer 26

Separation

Answer 27

Sum of squares of each point in the cluster to the cluster centroid

Answer 28

Sum of squares of each point in the dataset to the global cluster mean

Answer 29

Total SS - total within SS

Answer 30

Ratio of between SS/Total SS (between 0.0 and 1.0)

Answer 31

The more variance is explained by the clusters

Answer 32

the better it is because you are explaining that much percentage of the variance

Answer 33

Not gracefully, it will try to include these in a cluster

Answer 34

Outlier detection and removal prior

Answer 35

Globular clusters (round)

Answer 36

empty cluster

Answer 37

Choose the point that is farthest away from any centroid, basically assign outlier to the last empty cluster Another option is to choose a replacement centroid at random from the cluster that has the highest SSE and split it

Answer 38

sizes, densities, non-globular

Answer 39

Space: O((m+K)n) Time: O(I*K*m*n)

Answer 40

Higher SSE and non-representative cluster centroids

Answer 41

The area of taxonomy, where hierarchical structures are common and elements under the same hierarchy automatically constitute a cluster

Answer 42

Choose k a-priori (prior). You look at what it creates and then try to choose a K according to your understanding of the problem

Answer 43

Agglomerative (bottom up) Divisive

Answer 44

Each point starts off as individual cluster, and at each step, merge closest pairs of clusters (Need cluster proximity metric)

Answer 45

All points in one cluster, at each step, split a cluster until singleton clusters of individual points remain (Which cluster to split and how to do the splitting)

Answer 46

Depicted as tree-like structure called Dendrograms

Answer 47

Labeled with the proximity between clusters, x axis is labels of data points

Answer 48

Look for closest cluster (in terms of squared distance) and join them. Continue until one cluster left

Answer 49

Draw horizontal line that intersects k lines, k being the number of clusters you want

Answer 50

Compute proximity matrix if necessary Merge the closest two clusters Update the proximity matrix to reflect the proximity between the new cluster and original clusters Repeat until only one cluster remains

Answer 51

MIN (single link), MAX (complete link), Group average

Answer 52

Defines the distance between two clusters as the shortest distance between any pair of points in the two clusters

Answer 53

Measures the distance between two clusters as the greatest distance between any pair of points in the two clusters

Answer 54

Defines the distance between two clusters as the average of all pairwise distances between points in the two clusters

Answer 55

MIN(single Link), MAX (complete link)

Answer 56

O(m^2) Space O(m^2) Time using heap --> O(m^2 log m)

Answer 57

Yes if attributes are wide ranges

Answer 58

Euclidean Manhattan

Answer 59

Same as before, MIN, MAX, Average, depends whether or not your data is susceptible to outliers

Answer 60

final, observations may not belong to different centroids over time

Answer 61

it can be used in supervised learning mode

Answer 62

class label, belongs to that class

Answer 63

the same as classification

Answer 64

A density-based clustering algorithms parametrized by a radius/neighborhood and number of neighboring points

Answer 65

Objects within a radius e from a source object

Answer 66

If e-neighborhood of a source point (object) contains at least MinPts other points (object), then the source point is in a "high-density" area

Answer 67

Core point Border point Noise point

Answer 68

A point that has at least MinPts within an e

Answer 69

A point that is not a core point, but is in the neighborhood of a core point

Answer 70

Points that are neither core or border points

Answer 71

I pick a point I draw a circle of radius e around that point Anything that falls within the circle is a core point Anything that falls on the boundary is a border point Anything that falls outside the circle is a noise point

Answer 72

Label all points as core, border, or noise points Eliminate all noise points Put an edge between all core points that are within Eps of each other Make each group of connected core points into a separate cluster Assign each border point to one of the clusters of its associated core points

Answer 73

Graph of taking the distance of each point in the data set to its nearest 5 neighbors. It does that across all points in data set. Take distance and sort them then look at curve. See where it starts to rise and choose epsilon value of where it starts to rise

Answer 74

O(m) space O(m^2) time worse case O(m log m) in low dimension space

Answer 75

Then a sparse cluster may be erroneously labeled as noise

Answer 76

Then dense clusters may be merged together, and small clusters may be labeled as noise

Answer 77

2*dimensions

Answer 78

Decrease epsilon

Answer 79

increase epsilon

Answer 80

Calculate the average of distances of every point to its k nearest neighbors, k-dist will be large for noise points and look for value where it starts rising

Answer 81

Avoid finding random patterns in data Compare two clusters Compare two clustering algorithms

Answer 82

Internal External Relative

Answer 83

Used to measure the goodness of a clustering structure without respect to external information (SSE for example)

Answer 84

Match cluster result with external results (class labels)

Answer 85

Used to compare two different clustering algorithms or clusters

Answer 86

Cohesion Separation

Answer 87

Measures how close are objects in the same cluster, measure by within cluster SSE)

Answer 88

Measures how well separated are the clusters

Answer 89

Measures the degree of confidence in the clustering assignment of a particular observation

Answer 90

Si = (Bi-Ai)/max(Bi,Ai)

Answer 91

[-1,1], large Si means observations are well clustered, small Si means observations lies between two clusters, Negative Si means observations probably placed in wrong cluster

Answer 92

the average distance from the ith point to all the other points in the same cluster as the ith point

Answer 93

average distance from the ith point to all the other points in the other cluster

Clustering FINAL Flashcards

(123 cards)