Data Mining Flashcards

1
Q

What are the seven main types of clustering algorithms?

A

Pattern-Based
Projected
Partitioning/Representative
Density
Hierarchical
Bi-Clustering
Correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which algorithms are Pattern-Based

A

p-Cluster
MaPle
EDSC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which algorithms are Projection-based?

A

PROCLUS: PROjected CLUStering
MD5
Isomap
t-SNE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which algorithms are paritioning/representative?

A

kMeans
kMediod

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which algorithms are Density based?

A

CLIQUE
DBSCAN
OPTICS
OP-Cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which algorithms are hierarchical?

A

DiSH
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
CURE (Clustering Using Representatives)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which algorithms use bi-clustering?

A

delta-bicluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the main goal in clustering?

A

To find meaningful features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are four strategies to deal with high-dimensional data?

A

1) dimensionality reduction (PCA, MD5, SNE)
2) regularization (L1, L2)
3) ensemble methods
4) projected clustering (MD5, SNE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What function describes the probability that a random value will be <= a given value?

A

Cumulative Density Distribution function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What types of kernels can be used to estimate density?

A

Discrete
Gaussian
Multivariate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What process splits data into cells where all the points are closest to the seed point?

A

Voronoi parcelling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What process compares the local density of a point to the local density of its k-nearest-neighbors?

A

LOF (local outlier factor)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are masking and swamping?

A

Masking is when an outlier gets included in the cluster. Swamping is when the model is changed so the inliers appear as outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the silhouette score?

A

A measure of how well a data point is classified relative to other points in the cluster and ranges from -1 to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the silhouette score used for?

A

To evaluate the performance of an algorithm and/or to decide on the number of clusters to set as a parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What types of norms are there?

A

Euclidean
Manhattan
Max norm
Weighted Euclidean
Quadratic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is an outlier?

A

Arouses suspicion that it was generated by a different mechanism
Appears to deviate markedly from the sample
Is inconsistent with the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why do outliers occur?

A

measurement/transmission errors, data input/processing errors, attacks/fraud

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What’s the difference between a local and a global outlier?

A

Local outlier: instance that is very different from the instances around it
Global outlier: very different from entire dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are Arthur’s main challenges in dealing with High Dimensional Data?

A

1) “concentration effect”: curse of dim
2) discrimination vs. ranking of values
3) combinatorial issues and subspace selection
4) Hubness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is hubness?

A

Phenomenon where some instances in a dataset (hubs) occur as the nearest neighbors of many other instances, more than expected by chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the definition of “concentration of distances”/curse of dimensionality?

A

The ratio of the variance of length of any point vector converges to zero with increasing data dimensionality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a shrinking hypersphere?

A

A method used in density-based clustering (i.e. DBSCAN) to find clusters of similar data in a high-dimensional space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What else can be useful when absolute distance is not?
Distance rankings
26
What is the central limit theorem?
A result in probability where if you have a large number of samples, the distribution will be approximately normal (relevant to curse of dimensionality)
27
What are 8 problems with High Dimensional Data?
1) curse of dimensionality 2) noise 3) circle of needing to know the neighbors to choose the right subspace and needing to know the right subspace to find correct neighbors 4) bias of scores (toward high dim subspaces) 5) scores appearing identical 6) exponential subspaces 7) data-snooping bias 8) hubness
28
What are two methods of subspace traversal?
top-down and bottom-up
29
What does projected clustering do?
Partitions the data into disjoint clusters
30
What does subspace clustering do?
Finds all clusters in all subspaces (possibly with overlap)
31
How does Apriori work?
1) search DB one for each length of a transaction pattern 2) count occurences of candidates 3) eliminate candidates for next round that are not frequent in longer combos "PRUNING" => THINK ALPHABET COMBOS
32
What is the downward closure property?
AKA "monotonicity" states that if a given input has a certain property, all subsets of a frequent itemset must also be frequent. Basis for Apriori.
33
Are subspace and projected clustering usually bottom-up or top-down?
Subspace usually bottom-up and projected usually top-down
34
What is bottom-up clustering?
Starting with each point as its own cluster and successively merges pairs of clusters
35
What is top-down clustering?
Starts with the all points as a single cluster and divides them into smaller clusters
36
What is bi-clustering?
Simultaneously clustering rows and columns
37
How does PCA work?
1) take local selection of points, then build covariance matrix to derive eigensystem 2) defines hyperplane using strong eigenvectors 3) sum of the smallest eigenvectors defines the minimum
38
What are some non-axis parallel approaches to clustering?
Pattern-based, bi/co-clustering, correlation clustering Density-based clustering usually
39
Why use a non-axis parallel approach over an axis-parallel one?
It can handle irregularly shaped data that do not follow a ball or ellipsoid space
40
What do pattern-based clustering algorithms find?
Pairwise linear dependencies (simple positive correlations)
41
What characterizes axis-parallel algorithms?
They break down the feature space into hyperrectangles that form the clusters
42
What does correlation clustering find?
Linear dependencies (more general correlations)
43
What are the three main problems for clustering ensembles?
1) the number of clusters can vary among different solutions 2) cluster labels are symbolic (not classes) 3) how to achieve diversity is harder to see
44
What is used to combine the results of ensemble clustering ?
A consensus function
45
What are the names of some consensus clustering methods?
4C: Computing Clusters of Correlation Connected Objects Cluster model: Deriving Quantitative Models For Correlation Clusters COPAC CASH (Hough-Transform based) ERiC: Explaining Relationships Among Correlation Clusters
46
Which algorithm requires two parameters and what are they?
DBSCAN: Density-Based Spatial Clustering of Applications with Noise eps and MinPts
47
Which type of clustering usually adopts the bottom-up strategy and which one usually adopts the top-down strategy? What do they rely on?
Bottom-up: subspace clustering Top-down: projected clustering Rely on the assumption that subspaces are axis-paralllel
48
What are the two flavors of correlation based on?
PCA and Hough Transform
49
What is a fundamental efficiency technique to finding the localized region near a point as an outlier detection mechanism?
Approximate neighborhoods
50
What are three ways to implement approximate neighborhoods?
Random projection, locality sensitive hashing, space-filling curves
51
What are five major efficiency techniques used in top-N outlier detection?
Approximate neighborhoods RBRP (Recursive Binning and Re-Projection) PINN (Projection-Indexed Nearest Neighbors) Space-filling curves (Very Good) Neighborhood Approximations
52
What is the goal of subspace outlier detection?
To find outliers in relevant subspaces that are not outliers in the full-dimensional space
53
What is data snooping bias?
When the selection of a model or set of parameters is influenced by the data it is being run against (which can lead to an overfit model that doesn't generalize well
54
What are three methods of subspace outlier detection?
OutRank SOD Correlation Outlier
55
What is another term for similarity
Distance measure
56
What is the main challenge with similarity?
Defining a suitable distance
57
What are some common evaluation methods?
F1, precision, recall, silhouette score
58
What is conditional entropy?
A measure of the amount of uncertainty in one variable given knowledge of another random variable H(X|Y) where X and Y are the two variables being considered
59
What is pair counting?
A technique used to measure the correlation between two variables or sets of data. Involves counting the number of data points that have a specific relationship or fall within a certain range. Can be used to compute correlation.
60
What are two solutions to evaluating results?
Mapping sets of objects and pair counting
61
What does pairwise-stable mean?
When the different subsets of the data are used, the same inputs are consistently grouped together. Measure of robustness of an algorithm
62
What is variance?
The degree of dispersion of the data points within a cluster (how far data points are from each other and the centroid)
63
What is the sample count in the Apriori algorithm called?
Support
64
What does ORCLUS use for dimensionality reduction?
SVD (Singular Value Decomposition)
65
What are four consensus methods mentioned in lecture?
HGPA (Hypergraph Partitioning) CSPA (Cluster-based similarity partitioning) maximum likelihood voting
66
What is SUBCLU's three step process?
1) All 1-dimensional subspaces are clustered. All clusters in higher-dim subspaces will be subsets of these clusters producing k+1 dimensional candidate subspaces 2) After pruning noise, DBSCAN is run on the subspaces to see if they contain clusters. 3) If it does, it is used in the next combination of subspaces
67
What is single-linkage?
merging based on minimum distance between any two observations
68
What is complete linkage?
merging based on maximum distance between any two observations
69
What are the three phases of PROCLUS?
1) Initialization of cluster mediods 2) Iterative phase (refining mediods from Phase 1) 3) Refinement (reassigning subspaces to mediods)
70
How does SVD (single value decomposition) work?
1) construct a covariance matrix with the data 2) find the eigenvectors and eigenvalues of that matrix 3) choose the strongest (largest) eigenvalues to retain the most useful information
71
What are the three major steps in ORCLUS?
1) Assign walk points and assign to seed with lowest distance calculate new seeds and return 2) Find vectors construct covariance matrix, find eigen-vectors and - values, pick lowest 3) Merge use SVD on each pair of clusters use Energy as a metric for how well they might combine merge best fitting ones and recalculate
72
What is the quality measure of clusters called in ORCLUS?
Energy
73
What are the inputs to ORCLUS?
k, number of clusters k0, number of seeds to start with l, number of dimensions to project clusters onto
74
What does pairwise stable mean?
Between cluster distance dominates the within cluster distance