Data Mining Flashcards
What are the seven main types of clustering algorithms?
Pattern-Based
Projected
Partitioning/Representative
Density
Hierarchical
Bi-Clustering
Correlation
Which algorithms are Pattern-Based
p-Cluster
MaPle
EDSC
Which algorithms are Projection-based?
PROCLUS: PROjected CLUStering
MD5
Isomap
t-SNE
Which algorithms are paritioning/representative?
kMeans
kMediod
Which algorithms are Density based?
CLIQUE
DBSCAN
OPTICS
OP-Cluster
Which algorithms are hierarchical?
DiSH
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
CURE (Clustering Using Representatives)
Which algorithms use bi-clustering?
delta-bicluster
What is the main goal in clustering?
To find meaningful features
What are four strategies to deal with high-dimensional data?
1) dimensionality reduction (PCA, MD5, SNE)
2) regularization (L1, L2)
3) ensemble methods
4) projected clustering (MD5, SNE)
What function describes the probability that a random value will be <= a given value?
Cumulative Density Distribution function
What types of kernels can be used to estimate density?
Discrete
Gaussian
Multivariate
What process splits data into cells where all the points are closest to the seed point?
Voronoi parcelling
What process compares the local density of a point to the local density of its k-nearest-neighbors?
LOF (local outlier factor)
What are masking and swamping?
Masking is when an outlier gets included in the cluster. Swamping is when the model is changed so the inliers appear as outliers
What is the silhouette score?
A measure of how well a data point is classified relative to other points in the cluster and ranges from -1 to 1
What is the silhouette score used for?
To evaluate the performance of an algorithm and/or to decide on the number of clusters to set as a parameter
What types of norms are there?
Euclidean
Manhattan
Max norm
Weighted Euclidean
Quadratic
What is an outlier?
Arouses suspicion that it was generated by a different mechanism
Appears to deviate markedly from the sample
Is inconsistent with the dataset
Why do outliers occur?
measurement/transmission errors, data input/processing errors, attacks/fraud
What’s the difference between a local and a global outlier?
Local outlier: instance that is very different from the instances around it
Global outlier: very different from entire dataset
What are Arthur’s main challenges in dealing with High Dimensional Data?
1) “concentration effect”: curse of dim
2) discrimination vs. ranking of values
3) combinatorial issues and subspace selection
4) Hubness
What is hubness?
Phenomenon where some instances in a dataset (hubs) occur as the nearest neighbors of many other instances, more than expected by chance
What is the definition of “concentration of distances”/curse of dimensionality?
The ratio of the variance of length of any point vector converges to zero with increasing data dimensionality
What is a shrinking hypersphere?
A method used in density-based clustering (i.e. DBSCAN) to find clusters of similar data in a high-dimensional space