Overview Flashcards

Question 1

Q

What is statistics? (informal)

Answer

A

It is a field that takes the data in the world and transforms that into Information, that can be used to make decisions.

Question 2

Q

What is a Scatter Plot? Give a reason to use it.

Answer

A

In that kind of graph, each data point is plotted as a dot at a Cartesian plan.

On the Scatter Plot we can see if the data has a linear relationship or not.

Question 3

Q

What are outliers?

Answer

A

They are data points that deviate a lot from the expectation and can distort the mean of the data set.

Question 4

Q

What is a bar chart? Give a reason to use it.

Answer

A

In a bar chart, we choose an interval in the x axis and we aggregate the values of the data points in that interval, creating a bar with the mean of all values.

That way we get rid of the noise and get a better understanding of the Global Trend of the data

Question 5

Q

What is a histogram?

Answer

A

A histogram is a special case of a Bar Chart.

While Bar Charts look at 2D data, histograms look at 1D data.

In histograms, the X axis is the data we are seeing (e.g. salary) and on the Y axis we see the frequency (or count) of how many data points fall into the interval defined in the X axis.

Ex:. from $120.000 to $130.000 salary, how many employees fall into that bucket?

Question 6

Q

What is the Simpsons Paradox, also know as the reversal paradox?

Answer

A

is a phenomenon inprobabilityandstatistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. It is sometimes given the descriptive titlereversal paradoxoramalgamation paradox.

Seen on the UC Berkeley Gender Bias study

Question 7

Q

What is probability theory?

Answer

A

Probability Theory is the branch of math that deals with Probability. Probability measures the likelihood of an event to occur.

Question 8

Q

What is the first law of probability?

Answer

A

The sum of the probability (P) of each possible event to happen is always 1.

Question 9

Q

What is the probability of multiple independent events to happen?

Answer

A

is the product of their probability.

Question 10

Q

What is a conditional probability? Give an example.

Answer

A

It is the probability of an event to happen given that a dependent event has happened.

Ex: the probability of a cancer test be positive given that the pacient has cancer.

Question 11

Q

What is the notation of the conditional probability of outcome A given an outcome B has happened?

Answer

A

P( Outcome A | Outcome B )

Question 12

Q

Given an event A that depends on event B. What is the total probability of A? Use as examples of A and B a cancer test being positive and the pacient having cancer.

Answer

A

P(Positive) = P(Positive|Cancer)*P(Cancer) +

P(Positive|!Cancer)*P(!Cancer)

Question 13

Q

What is Bayes Rule?

Answer

A

It is a method to discover the probability of an event given the outcome of an event dependent to it. (eg; probability of having cancer given a test result is positive)

Question 14

Q

Write an example of Bayes Rule calculation for having cancer given the test was positive. Show which terms are the Posterior, the Joint and the Prior probabilities.

Answer

A

P(Cancer) - prior probability

P(Positive) = P(Positive | Cancer)P(C) + P(Positive | !Cancer)P(!Cancer) - total probability (normalizer)

P(Cancer | Positive) = P(Positive | Cancer) * P(Cancer) / P(Positive) - posterior probability

Question 15

Q

In a continuous distribution of probability, what is the probability of a specific data point?

Question 16

Q

What is a density function?

Answer

A

It is an equation that represents a continuous distribution of probability.

Question 17

Q

How do I calculate probability given a density function for a continuous distribution?

Answer

A

It is the integral of the function in the desired interval.

Question 18

Q

What are estimators?

Answer

A

They are techniques that applied to a sequence of outcomes (eg coin flips) can estimate the probability of each outcome (eg probability of getting heads)

Question 19

Q

What is the Maximum Likelihood Estimator? Use as examples calculating P(heads) from coin flips.

Answer

A

We could calculate empirically the P(heads) with

Maximum Likelihood Estimator:

SUM(X1…Xn) / N (number of times we see heads divided by the number of flops)

Question 20

Q

What is a confounding variable?

Answer

A

It is a variable that causes other two or more variables that shows correlation between each other.

Question 21

Q

What is the disadvantage of the maximum likelihood Estimator?

Answer

A

It is very sensitive with the amount of data available. It could go to extremes if few data is available.

Question 22

Q

Explain the Laplacian Estimator.

Answer

A

It is calculated like the maximum likelihood estimator but we add fake neutral data to the data set in order to smooth the result.

Question 23

Q

What are the three Ms of statistics?

Answer

A

Mean, Median, Mode

Question 24

Q

What is the Mean of a dataset? What is it good for?

Answer

A

Sum(Xn) /N

It is good to give a characterization of a group of data, however the mean is very sensitive to outliers

Question 25

Q

What is the Median of a dataset? How do you find it?

Answer

A

It is the exact middle data point of the data set. To find, order the dataset and take the middle point. In case of tie, choose anyone.

Question 26

Q

What is the Mode of a dataset? How do I find it?

Answer

A

It is the most frequent datapoint.

To find it, count the most frequent. In case of tie, pick any one at random.

Question 27

Q

What is a multimodal dataset?

Answer

A

It is a dataset with many concentration of different data points. Many modes.

Question 28

Q

What information is not captured by any of the 3 Ms of statistics?

Answer

A

The spread of the data in the dataset

Question 29

Q

Which information captures the spread of a dataset?

Answer

A

The variance.

Question 30

Q

What is the formula of the variance?

Answer

A

SUM(Xi - Mean)^2/N

Question 31

Q

What is the standard deviation?

Answer

A

The square root of the variance.

Question 32

Q

What happens to the mean and the variance if you add a constant C to every datapoint?

Answer

A

The mean will increase by C

The variance will stay the same

Question 33

Q

What happens to the mean and the variance if you multiply every datapoint by a factor (percentage for ex)?

Answer

A

The mean and std deviation will be also multiplied by the factor

The variance will be multiplied by the square of the factor.

Question 34

Q

What is the formula of the Standard Score of a datapoint? What does it mean?

Answer

A

Z = (X - Mean) / Std.deviation

It represents the distance of a specific datapoint to the mean, in the context of a dataset.

Question 35

Q

What are outliers? What are possible origins of it?

Answer

A

Outliers are datapoints that don’t accurately represent the real world. It should be discarded from the analysis of the dataset.

Possible origins can be:

corruption of the database
error during data capture (ex: typos)
exceptions in the real world

Question 36

Q

What are quartiles? How do you calculate it?

Answer

A

Quartiles is a way of getting rid of the lower and upper end of the data.

You divide the dataset in the median, and both halfs in their median (which are called quartiles)

1 2 3 3 4 4 5 5 6 7 8

3: lower quartile
4: median
6: upper quartile

All data from the lower and upper quartile is called interquartile.

This works for dataset with certain size (4n + 3) to maintain simmetry.

Question 37

Q

What are percentiles?

Answer

A

Similar to quartile, we ignore a percentage of the data.

The k-th percentile means taking the k% of the data in the bottom.

Question 38

Q

for n coin flips, how many outcomes have k heads?

Answer

A

n! / ( (n-k)! * k! )

Question 39

Q

For a sequence of events, ex: coin flips, we might be interested in the probability of a certain kind of outcome, ex: number of heads = k.Which kind of distribution models that situation?

Answer

A

The binomial distribution.

Question 40

Q

For a very large number of events, the binomial distribution can be approximated by which function?

Question 41

Q

What is the distribution of a sequence of events that follow the normal distribution (ex: sequence of golf strikes)

Answer

A

It will be a normal distribution with the sum of the means and the sum of the variances.

Brainscape's Knowledge GenomeTM

Overview Flashcards

Recall all important concepts presented in the course

Brainscape's Knowledge Genome^TM