Data_to_Insights Flashcards

Question 1

Q

Rand Index

Answer

A

The Rand index[1] or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even when class labels are not used.

Question 2

Q

Rand Index

Answer

A

Measures like the Rand Index are called external evaluations,
because they require outside information about a ground truth clustering.

Question 3

Q

Ggobi

Answer

A

GGobi is an open source visualization program for exploring high-dimensional data. It provides highly dynamic and interactive graphics such as tours, as well as familiar graphics such as the scatterplot, barchart and parallel coordinates plots. Plots are interactive and linked with brushing and identification.

Question 4

Q

Amazon Mechanical turk

Answer

A

https://www.mturk.com/media/intro/mainbanner.gif

Question 5

Q

Clustering

Answer

A

Grouping data according to similarity

Question 6

Q

k-means

Answer

A

K-means clustering is typically very sensitive to outliers.

Question 7

Q

k-medoids

Answer

A

s based on centroids (or medoids) calculating by minimizing the absolute distance between the points and the selected centroid, rather than minimizing the square distance. As a result, it’s more robust to noise and outliers than k-means.

Question 8

Q

PCA

Answer

A

Principal Component Analysis
It is most often used when each data point contains a lot of measurements and
not all of those may be meaningful.
Or there’s a lot of covariance in the measurements.

Question 9

Q

Clustering: ground truth

Answer

A

we believe that the ground truth is

that there’s exactly one true group that the data point belongs to.

Question 10

Q

Feature vs Cluster

Answer

A

the underlying structure is a feature allocation instead of a clustering.
Note that this is a different use of the word feature than we saw on
previous videos.
A similar idea is to say that the data points exhibit mixed membership.

Question 11

Q

feature allocation, admixture, mixed membership

Answer

A

capture the idea that data points can belong to multiple groups simultaneously.

Question 12

Q

Eigenvector

Answer

A

It is most often used when each data point contains a lot of measurements and
not all of those may be meaningful.
Or there’s a lot of covariance in the measurements.

Question 13

Q

Eigenvector

Answer

A

It is most often used when each data point contains a lot of measurements and
not all of those may be meaningful.
Or there’s a lot of covariance in the measurements.
The eigenvectors with the largest eigenvalues are the principle components.

Question 14

Q

Volume

Answer

A

number of edges in a cluster

Question 15

Q

LaPlacian

Answer

A

If there’s an edge, the entry is -1, otherwise it’s 0.
The entries on the diagonal are the degrees of the nodes.
The degree of a node is the number of edges that it meets.

Question 16

Q

TF IDF format.

Answer

A

This stands for term frequency, inverse document frequency.

Question 17

Q

Quantile regressions are great for

Answer

A

Quantile regressions are great for
determining the factors that affect the outcomes in the tails.
For example, in risk management,
we might predict the extremal conditional percentiles of Y using the information X.
This type of prediction is called the conditional value-at-risk analysis.
In medicine, we could be interested in how smoking and
other controllable factors X affect very low percentiles of infant birth weights Y.
In supply chain management, we could try to predict the inventory level for
a product that is able to meet the 90-th percentile of demand Y
given the economic conditions described by X.

Question 18

Q

p value

Answer

A

strength of test

the larger p value, the stronger support for null hypothesis

Question 19

Q

p - value

Answer

A

The p-value is naturally a random variable, as it is the probability of
a observing a value smaller than the observed sample average.
Since the simple average is a random variable so
is the probability of seeing the value below it.
What is the distribution of this random variable?
Well, there is a magic answer.
The distribution of the p-value is uniform in the interval between 0 & 1 when
the distribution corresponding to the measurement is continuous.

Question 20

Q

Type 1 error

Answer

A

The error of rejecting a true hypothesis is called type 1 error,

Question 21

Q

Type 2 error

Answer

A

Type 2 error corresponds to the case of accepting by mistake a hypothesis

Question 22

Q

Perceptron

Answer

A

A perceptron is a function that has several inputs and one output.

Question 23

Q

Belief Matching

Answer

A

A popular approach to do so is the belief propagation algorithm.

Question 24

Q

spectral clustering

Answer

A

Goal is to find a projection of the data before clustering.

Question 25

Q

Sparse methods

Answer

A

Goal is simply

keep only the most relevant side information.

Question 26

Q

Locality-sensitive hashing

Answer

A

Part of the broader family of approximate

nearest neighbors.

Question 27

Q

Bipartite graph

Answer

A

A graph is called bipartite if the nodes can be partitioned into two sets, so
that there are only edges between the two sets and not within a set.

Brainscape's Knowledge GenomeTM

Data_to_Insights Flashcards

Brainscape's Knowledge Genome^TM