Overview Flashcards

Recall all important concepts presented in the course

1
Q

What is statistics? (informal)

A

It is a field that takes the data in the world and transforms that into Information, that can be used to make decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Scatter Plot? Give a reason to use it.

A

In that kind of graph, each data point is plotted as a dot at a Cartesian plan.

On the Scatter Plot we can see if the data has a linear relationship or not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are outliers?

A

They are data points that deviate a lot from the expectation and can distort the mean of the data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a bar chart? Give a reason to use it.

A

In a bar chart, we choose an interval in the x axis and we aggregate the values of the data points in that interval, creating a bar with the mean of all values.

That way we get rid of the noise and get a better understanding of the Global Trend of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a histogram?

A

A histogram is a special case of a Bar Chart.

While Bar Charts look at 2D data, histograms look at 1D data.

In histograms, the X axis is the data we are seeing (e.g. salary) and on the Y axis we see the frequency (or count) of how many data points fall into the interval defined in the X axis.

Ex:. from $120.000 to $130.000 salary, how many employees fall into that bucket?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the Simpsons Paradox, also know as the reversal paradox?

A

is a phenomenon inprobabilityandstatistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. It is sometimes given the descriptive titlereversal paradoxoramalgamation paradox.

Seen on the UC Berkeley Gender Bias study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is probability theory?

A

Probability Theory is the branch of math that deals with Probability. Probability measures the likelihood of an event to occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the first law of probability?

A

The sum of the probability (P) of each possible event to happen is always 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the probability of multiple independent events to happen?

A

is the product of their probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a conditional probability? Give an example.

A

It is the probability of an event to happen given that a dependent event has happened.

Ex: the probability of a cancer test be positive given that the pacient has cancer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the notation of the conditional probability of outcome A given an outcome B has happened?

A

P( Outcome A | Outcome B )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Given an event A that depends on event B. What is the total probability of A? Use as examples of A and B a cancer test being positive and the pacient having cancer.

A

P(Positive) = P(Positive|Cancer)*P(Cancer) +

P(Positive|!Cancer)*P(!Cancer)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Bayes Rule?

A

It is a method to discover the probability of an event given the outcome of an event dependent to it. (eg; probability of having cancer given a test result is positive)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Write an example of Bayes Rule calculation for having cancer given the test was positive. Show which terms are the Posterior, the Joint and the Prior probabilities.

A

P(Cancer) - prior probability

P(Positive) = P(Positive | Cancer)P(C) + P(Positive | !Cancer)P(!Cancer) - total probability (normalizer)

P(Cancer | Positive) = P(Positive | Cancer) * P(Cancer) / P(Positive) - posterior probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

In a continuous distribution of probability, what is the probability of a specific data point?

A

Zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a density function?

A

It is an equation that represents a continuous distribution of probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do I calculate probability given a density function for a continuous distribution?

A

It is the integral of the function in the desired interval.

18
Q

What are estimators?

A

They are techniques that applied to a sequence of outcomes (eg coin flips) can estimate the probability of each outcome (eg probability of getting heads)

19
Q

What is the Maximum Likelihood Estimator? Use as examples calculating P(heads) from coin flips.

A

We could calculate empirically the P(heads) with

Maximum Likelihood Estimator:

SUM(X1…Xn) / N (number of times we see heads divided by the number of flops)

20
Q

What is a confounding variable?

A

It is a variable that causes other two or more variables that shows correlation between each other.

21
Q

What is the disadvantage of the maximum likelihood Estimator?

A

It is very sensitive with the amount of data available. It could go to extremes if few data is available.

22
Q

Explain the Laplacian Estimator.

A

It is calculated like the maximum likelihood estimator but we add fake neutral data to the data set in order to smooth the result.

23
Q

What are the three Ms of statistics?

A

Mean, Median, Mode

24
Q

What is the Mean of a dataset? What is it good for?

A

Sum(Xn) /N

It is good to give a characterization of a group of data, however the mean is very sensitive to outliers

25
Q

What is the Median of a dataset? How do you find it?

A

It is the exact middle data point of the data set. To find, order the dataset and take the middle point. In case of tie, choose anyone.

26
Q

What is the Mode of a dataset? How do I find it?

A

It is the most frequent datapoint.

To find it, count the most frequent. In case of tie, pick any one at random.

27
Q

What is a multimodal dataset?

A

It is a dataset with many concentration of different data points. Many modes.

28
Q

What information is not captured by any of the 3 Ms of statistics?

A

The spread of the data in the dataset

29
Q

Which information captures the spread of a dataset?

A

The variance.

30
Q

What is the formula of the variance?

A

SUM(Xi - Mean)^2/N

31
Q

What is the standard deviation?

A

The square root of the variance.

32
Q

What happens to the mean and the variance if you add a constant C to every datapoint?

A

The mean will increase by C

The variance will stay the same

33
Q

What happens to the mean and the variance if you multiply every datapoint by a factor (percentage for ex)?

A

The mean and std deviation will be also multiplied by the factor

The variance will be multiplied by the square of the factor.

34
Q

What is the formula of the Standard Score of a datapoint? What does it mean?

A

Z = (X - Mean) / Std.deviation

It represents the distance of a specific datapoint to the mean, in the context of a dataset.

35
Q

What are outliers? What are possible origins of it?

A

Outliers are datapoints that don’t accurately represent the real world. It should be discarded from the analysis of the dataset.

Possible origins can be:

  • corruption of the database
  • error during data capture (ex: typos)
  • exceptions in the real world
36
Q

What are quartiles? How do you calculate it?

A

Quartiles is a way of getting rid of the lower and upper end of the data.

You divide the dataset in the median, and both halfs in their median (which are called quartiles)

1 2 3 3 4 4 5 5 6 7 8

3: lower quartile
4: median
6: upper quartile

All data from the lower and upper quartile is called interquartile.

This works for dataset with certain size (4n + 3) to maintain simmetry.

37
Q

What are percentiles?

A

Similar to quartile, we ignore a percentage of the data.

The k-th percentile means taking the k% of the data in the bottom.

38
Q

for n coin flips, how many outcomes have k heads?

A

n! / ( (n-k)! * k! )

39
Q

For a sequence of events, ex: coin flips, we might be interested in the probability of a certain kind of outcome, ex: number of heads = k.Which kind of distribution models that situation?

A

The binomial distribution.

40
Q

For a very large number of events, the binomial distribution can be approximated by which function?

A

Gaussian

41
Q

What is the distribution of a sequence of events that follow the normal distribution (ex: sequence of golf strikes)

A

It will be a normal distribution with the sum of the means and the sum of the variances.