Data Science Final Flashcards

(33 cards)

1
Q

SD of the sample mean

A

The SD of the sample mean measures how much, on avg, a sample mean is expected to vary from the true pop mean, i.e. how accurate our estimate of the mean is based on the sample size.

SD of sample mean = population SD / square root of sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which formula calculates the minimum sample size needed and/or the width of your confidence interval?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is z?

A

z is the number of SD’s away from the mean needed for your confidence level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

An example of how confidence intervals and SD’s connect

A

Since 95% of all sample means fall within 2 SEs of the true mean, we can flip it around and say:
“There’s a 95% chance the true mean is within 2 SEs of my sample mean.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Chebyshev’s inequality?

A

Used for worst-case scenario no matter the data shape.

It’s more conservative than the normal curve and not used for precise confidence intervals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the relationship between confidence interval and p-value?

A

If the null value is outside the confidence interval, the p-value is smaller than the cutoff — so you reject the null.

But ^ only if the confidence level matches the p-value cutoff. If not, you can’t make a conclusion just from the interval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Matching confidence levels and p-value cutoffs

A

90% CI 10% cutoff (α = 0.10)
95% CI 5% cutoff (α = 0.05)
99% CI 1% cutoff (α = 0.01)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Confidence Interval equation

A

ConfidenceInterval = SampleMean±(NumberofStandard Deviations × SDoftheSampleMean)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the probability that ‘at least one’ of x occurs?

A

P(atleastone) = 1 − P(none)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Correlation coefficient

A

r. Only measures linear association.

  1. convert 2 variables to standard units
  2. multiply
  3. take the mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Slope of regression line

A

Only valid to calculate regression line when doing simple linear regression.

r * (SD of y / SD of x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Intercept of regression line

A

Only valid to calculate regression line when doing simple linear regression

avg. of y - slope * avg. of x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

estimate of y/fitted value

A

slope * x + intercept (in standard units)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

RMSE (Root Mean Squared Error)

A
  1. square each error (actual y - predicted y)
  2. take avg. (MSE)
  3. take square root (so units match orig. values)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

minimize

A

The minimize function returns an array consisting of the slope & intercept that minimize the RMSE/MSE. We use this to calculate the regression line when we have a non-linear relationship or more than 1 x-variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Residuals vs. Errors

A

Errors: true y (for pop) - predicted y. We cannot calculate this since true y is unknown

Residuals: actual y (from data) - predicted y (from regression line).

17
Q

Residuals plot

A

Plots the residuals against the predictor variable. The residual plot of a good regression shows no pattern - if there is a pattern, there may be a non-linear relationship between variables. This is because residuals and the predictor are uncorrelated.

Avg. of residuals is always 0; regression line is unbalanced

18
Q

Heteroscedasticity

A

Occurs when your regression predictions are less reliable for certain x-values. The spread/variance of residual changes as x increases (your errors are bigger or smaller depending on where you are on the x-axis).

19
Q

Signal line and noise

A

We assume there is a perfect straight-line relationship (the signal line) underneath all the randomness (noise).

20
Q

How do we know if our scatterplot/correlation from 1 random sample is representative of the true population?

A
  1. Bootstrap the scatter plot as many times as the original sample size
  2. Collect all slopes and draw an empirical histogram
  3. Construct a 95% confidence interval
21
Q

How do we know if the slope of the true linear relation is 0?

A

If the interval contains 0, we cannot reject the null hypothesis that the slope of the linear relation is 0.

22
Q

Steps for Bootstrapping a Regression Confidence Interval

A
  1. resample table using sample(). This creates a new bootstrap sample of the data
  2. extract the x and y columns
  3. compute the correlation between x and y
  4. compute slope and intercept
  5. use regression line to predict y for a given x
  6. repeat above steps 10,000 times, storing predictions
  7. if using 95% CI, take 2.5th and 97.5th percentiles
23
Q

Total Variation Difference (TVD)

A

Use when you’re comparing two categorical distributions, like bar charts of proportions (e.g., gender, color, vote choice). It measures how far apart the two distributions are — 0 means identical, 1 means completely different. TVD is used with permutation tests when the variable is categorical.

  1. Calculate the proportion of each category in group 1 and group 2
  2. Subtract those proportions for each category
  3. Take the absolute value of each difference (so it’s all positive)
  4. Add them up
  5. Multiply by ½
24
Q

P-Value Formula

A

p-value for right-tailed test: P (teststatistic ≥ observed)

p-value for left-tailed test: 𝑃 (teststatistic ≤
observed)

p-value for two-sided test:
P (∣teststatistic∣≥∣observed∣)

25
Overlaid Histograms
Use when: You want to compare distributions of a numerical variable across two or more categories (e.g., Prior Goals for wins vs. losses)
26
Scatterplot
Use when: You're examining the relationship between two numerical variables
27
Line Graph
Use when: You're tracking a numerical variable over time
28
Bar Chart
Use when: You're showing the frequency or proportion of categorical values
29
Steps of the bootstrap method for generating another random sample that resembles the population
treat the original sample as if it were the population draw from the sample, at random with replacement, as many times as the original sample size
30
Left-tail distribution formula when bootstrapping confidence level
31
right-tail distribution formula when bootstrapping confidence level
32
kNN regression
Measuring the distance between a new data point and all old, but SAME, points in the training set (using Euclidean distance). Selecting the k closest (nearest) neighbors. Averaging the target values (e.g., prices) of those k neighbors to make a prediction.
33
Euclidian distance