Data_to_Insights Flashcards
Rand Index
The Rand index[1] or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even when class labels are not used.
Rand Index
Measures like the Rand Index are called external evaluations,
because they require outside information about a ground truth clustering.
Ggobi
GGobi is an open source visualization program for exploring high-dimensional data. It provides highly dynamic and interactive graphics such as tours, as well as familiar graphics such as the scatterplot, barchart and parallel coordinates plots. Plots are interactive and linked with brushing and identification.
Amazon Mechanical turk
https://www.mturk.com/media/intro/mainbanner.gif
Clustering
Grouping data according to similarity
k-means
K-means clustering is typically very sensitive to outliers.
k-medoids
s based on centroids (or medoids) calculating by minimizing the absolute distance between the points and the selected centroid, rather than minimizing the square distance. As a result, it’s more robust to noise and outliers than k-means.
PCA
Principal Component Analysis
It is most often used when each data point contains a lot of measurements and
not all of those may be meaningful.
Or there’s a lot of covariance in the measurements.
Clustering: ground truth
we believe that the ground truth is
that there’s exactly one true group that the data point belongs to.
Feature vs Cluster
the underlying structure is a feature allocation instead of a clustering.
Note that this is a different use of the word feature than we saw on
previous videos.
A similar idea is to say that the data points exhibit mixed membership.
feature allocation, admixture, mixed membership
capture the idea that data points can belong to multiple groups simultaneously.
Eigenvector
It is most often used when each data point contains a lot of measurements and
not all of those may be meaningful.
Or there’s a lot of covariance in the measurements.
Eigenvector
It is most often used when each data point contains a lot of measurements and
not all of those may be meaningful.
Or there’s a lot of covariance in the measurements.
The eigenvectors with the largest eigenvalues are the principle components.
Volume
number of edges in a cluster
LaPlacian
If there’s an edge, the entry is -1, otherwise it’s 0.
The entries on the diagonal are the degrees of the nodes.
The degree of a node is the number of edges that it meets.