Data_to_Insights Flashcards

1
Q

Rand Index

A

The Rand index[1] or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even when class labels are not used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Rand Index

A

Measures like the Rand Index are called external evaluations,
because they require outside information about a ground truth clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Ggobi

A

GGobi is an open source visualization program for exploring high-dimensional data. It provides highly dynamic and interactive graphics such as tours, as well as familiar graphics such as the scatterplot, barchart and parallel coordinates plots. Plots are interactive and linked with brushing and identification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Amazon Mechanical turk

A

https://www.mturk.com/media/intro/mainbanner.gif

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Clustering

A

Grouping data according to similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

k-means

A

K-means clustering is typically very sensitive to outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

k-medoids

A

s based on centroids (or medoids) calculating by minimizing the absolute distance between the points and the selected centroid, rather than minimizing the square distance. As a result, it’s more robust to noise and outliers than k-means.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

PCA

A

Principal Component Analysis
It is most often used when each data point contains a lot of measurements and
not all of those may be meaningful.
Or there’s a lot of covariance in the measurements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Clustering: ground truth

A

we believe that the ground truth is

that there’s exactly one true group that the data point belongs to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Feature vs Cluster

A

the underlying structure is a feature allocation instead of a clustering.
Note that this is a different use of the word feature than we saw on
previous videos.
A similar idea is to say that the data points exhibit mixed membership.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

feature allocation, admixture, mixed membership

A

capture the idea that data points can belong to multiple groups simultaneously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Eigenvector

A

It is most often used when each data point contains a lot of measurements and
not all of those may be meaningful.
Or there’s a lot of covariance in the measurements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Eigenvector

A

It is most often used when each data point contains a lot of measurements and
not all of those may be meaningful.
Or there’s a lot of covariance in the measurements.
The eigenvectors with the largest eigenvalues are the principle components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Volume

A

number of edges in a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

LaPlacian

A

If there’s an edge, the entry is -1, otherwise it’s 0.
The entries on the diagonal are the degrees of the nodes.
The degree of a node is the number of edges that it meets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

TF IDF format.

A

This stands for term frequency, inverse document frequency.

17
Q

Quantile regressions are great for

A

Quantile regressions are great for
determining the factors that affect the outcomes in the tails.
For example, in risk management,
we might predict the extremal conditional percentiles of Y using the information X.
This type of prediction is called the conditional value-at-risk analysis.
In medicine, we could be interested in how smoking and
other controllable factors X affect very low percentiles of infant birth weights Y.
In supply chain management, we could try to predict the inventory level for
a product that is able to meet the 90-th percentile of demand Y
given the economic conditions described by X.

18
Q

p value

A

strength of test

the larger p value, the stronger support for null hypothesis

19
Q

p - value

A

The p-value is naturally a random variable, as it is the probability of
a observing a value smaller than the observed sample average.
Since the simple average is a random variable so
is the probability of seeing the value below it.
What is the distribution of this random variable?
Well, there is a magic answer.
The distribution of the p-value is uniform in the interval between 0 & 1 when
the distribution corresponding to the measurement is continuous.

20
Q

Type 1 error

A

The error of rejecting a true hypothesis is called type 1 error,

21
Q

Type 2 error

A

Type 2 error corresponds to the case of accepting by mistake a hypothesis

22
Q

Perceptron

A

A perceptron is a function that has several inputs and one output.

23
Q

Belief Matching

A

A popular approach to do so is the belief propagation algorithm.

24
Q

spectral clustering

A

Goal is to find a projection of the data before clustering.

25
Q

Sparse methods

A

Goal is simply

keep only the most relevant side information.

26
Q

Locality-sensitive hashing

A

Part of the broader family of approximate

nearest neighbors.

27
Q

Bipartite graph

A

A graph is called bipartite if the nodes can be partitioned into two sets, so
that there are only edges between the two sets and not within a set.