Data Science Final Flashcards
(33 cards)
SD of the sample mean
The SD of the sample mean measures how much, on avg, a sample mean is expected to vary from the true pop mean, i.e. how accurate our estimate of the mean is based on the sample size.
SD of sample mean = population SD / square root of sample size
Which formula calculates the minimum sample size needed and/or the width of your confidence interval?
What is z?
z is the number of SD’s away from the mean needed for your confidence level.
An example of how confidence intervals and SD’s connect
Since 95% of all sample means fall within 2 SEs of the true mean, we can flip it around and say:
“There’s a 95% chance the true mean is within 2 SEs of my sample mean.”
What is Chebyshev’s inequality?
Used for worst-case scenario no matter the data shape.
It’s more conservative than the normal curve and not used for precise confidence intervals.
What is the relationship between confidence interval and p-value?
If the null value is outside the confidence interval, the p-value is smaller than the cutoff — so you reject the null.
But ^ only if the confidence level matches the p-value cutoff. If not, you can’t make a conclusion just from the interval.
Matching confidence levels and p-value cutoffs
90% CI 10% cutoff (α = 0.10)
95% CI 5% cutoff (α = 0.05)
99% CI 1% cutoff (α = 0.01)
Confidence Interval equation
ConfidenceInterval = SampleMean±(NumberofStandard Deviations × SDoftheSampleMean)
What is the probability that ‘at least one’ of x occurs?
P(atleastone) = 1 − P(none)
Correlation coefficient
r. Only measures linear association.
- convert 2 variables to standard units
- multiply
- take the mean
Slope of regression line
Only valid to calculate regression line when doing simple linear regression.
r * (SD of y / SD of x)
Intercept of regression line
Only valid to calculate regression line when doing simple linear regression
avg. of y - slope * avg. of x
estimate of y/fitted value
slope * x + intercept (in standard units)
RMSE (Root Mean Squared Error)
- square each error (actual y - predicted y)
- take avg. (MSE)
- take square root (so units match orig. values)
minimize
The minimize function returns an array consisting of the slope & intercept that minimize the RMSE/MSE. We use this to calculate the regression line when we have a non-linear relationship or more than 1 x-variable.
Residuals vs. Errors
Errors: true y (for pop) - predicted y. We cannot calculate this since true y is unknown
Residuals: actual y (from data) - predicted y (from regression line).
Residuals plot
Plots the residuals against the predictor variable. The residual plot of a good regression shows no pattern - if there is a pattern, there may be a non-linear relationship between variables. This is because residuals and the predictor are uncorrelated.
Avg. of residuals is always 0; regression line is unbalanced
Heteroscedasticity
Occurs when your regression predictions are less reliable for certain x-values. The spread/variance of residual changes as x increases (your errors are bigger or smaller depending on where you are on the x-axis).
Signal line and noise
We assume there is a perfect straight-line relationship (the signal line) underneath all the randomness (noise).
How do we know if our scatterplot/correlation from 1 random sample is representative of the true population?
- Bootstrap the scatter plot as many times as the original sample size
- Collect all slopes and draw an empirical histogram
- Construct a 95% confidence interval
How do we know if the slope of the true linear relation is 0?
If the interval contains 0, we cannot reject the null hypothesis that the slope of the linear relation is 0.
Steps for Bootstrapping a Regression Confidence Interval
- resample table using sample(). This creates a new bootstrap sample of the data
- extract the x and y columns
- compute the correlation between x and y
- compute slope and intercept
- use regression line to predict y for a given x
- repeat above steps 10,000 times, storing predictions
- if using 95% CI, take 2.5th and 97.5th percentiles
Total Variation Difference (TVD)
Use when you’re comparing two categorical distributions, like bar charts of proportions (e.g., gender, color, vote choice). It measures how far apart the two distributions are — 0 means identical, 1 means completely different. TVD is used with permutation tests when the variable is categorical.
- Calculate the proportion of each category in group 1 and group 2
- Subtract those proportions for each category
- Take the absolute value of each difference (so it’s all positive)
- Add them up
- Multiply by ½
P-Value Formula
p-value for right-tailed test: P (teststatistic ≥ observed)
p-value for left-tailed test: 𝑃 (teststatistic ≤
observed)
p-value for two-sided test:
P (∣teststatistic∣≥∣observed∣)