Data Science Final Flashcards by plant jade

SD of the sample mean

The SD of the sample mean measures how much, on avg, a sample mean is expected to vary from the true pop mean, i.e. how accurate our estimate of the mean is based on the sample size.

SD of sample mean = population SD / square root of sample size

How well did you know this?

Not at all

Perfectly

Which formula calculates the minimum sample size needed and/or the width of your confidence interval?

How well did you know this?

Not at all

Perfectly

What is z?

z is the number of SD’s away from the mean needed for your confidence level.

How well did you know this?

Not at all

Perfectly

An example of how confidence intervals and SD’s connect

Since 95% of all sample means fall within 2 SEs of the true mean, we can flip it around and say:
“There’s a 95% chance the true mean is within 2 SEs of my sample mean.”

How well did you know this?

Not at all

Perfectly

What is Chebyshev’s inequality?

Used for worst-case scenario no matter the data shape.

It’s more conservative than the normal curve and not used for precise confidence intervals.

How well did you know this?

Not at all

Perfectly

What is the relationship between confidence interval and p-value?

If the null value is outside the confidence interval, the p-value is smaller than the cutoff — so you reject the null.

But ^ only if the confidence level matches the p-value cutoff. If not, you can’t make a conclusion just from the interval.

How well did you know this?

Not at all

Perfectly

Matching confidence levels and p-value cutoffs

90% CI 10% cutoff (α = 0.10)
95% CI 5% cutoff (α = 0.05)
99% CI 1% cutoff (α = 0.01)

How well did you know this?

Not at all

Perfectly

Confidence Interval equation

ConfidenceInterval = SampleMean±(NumberofStandard Deviations × SDoftheSampleMean)

How well did you know this?

Not at all

Perfectly

What is the probability that ‘at least one’ of x occurs?

P(atleastone) = 1 − P(none)

How well did you know this?

Not at all

Perfectly

Correlation coefficient

r. Only measures linear association.

convert 2 variables to standard units
multiply
take the mean

How well did you know this?

Not at all

Perfectly

Slope of regression line

Only valid to calculate regression line when doing simple linear regression.

r * (SD of y / SD of x)

How well did you know this?

Not at all

Perfectly

Intercept of regression line

Only valid to calculate regression line when doing simple linear regression

avg. of y - slope * avg. of x

How well did you know this?

Not at all

Perfectly

estimate of y/fitted value

slope * x + intercept (in standard units)

How well did you know this?

Not at all

Perfectly

RMSE (Root Mean Squared Error)

square each error (actual y - predicted y)
take avg. (MSE)
take square root (so units match orig. values)

How well did you know this?

Not at all

Perfectly

minimize

The minimize function returns an array consisting of the slope & intercept that minimize the RMSE/MSE. We use this to calculate the regression line when we have a non-linear relationship or more than 1 x-variable.

How well did you know this?

Not at all

Perfectly

Residuals vs. Errors

Study These Flashcards

Errors: true y (for pop) - predicted y. We cannot calculate this since true y is unknown

Residuals: actual y (from data) - predicted y (from regression line).

Residuals plot

Study These Flashcards

Plots the residuals against the predictor variable. The residual plot of a good regression shows no pattern - if there is a pattern, there may be a non-linear relationship between variables. This is because residuals and the predictor are uncorrelated.

Avg. of residuals is always 0; regression line is unbalanced

Heteroscedasticity

Study These Flashcards

Occurs when your regression predictions are less reliable for certain x-values. The spread/variance of residual changes as x increases (your errors are bigger or smaller depending on where you are on the x-axis).

Signal line and noise

Study These Flashcards

We assume there is a perfect straight-line relationship (the signal line) underneath all the randomness (noise).

How do we know if our scatterplot/correlation from 1 random sample is representative of the true population?

Study These Flashcards

Bootstrap the scatter plot as many times as the original sample size
Collect all slopes and draw an empirical histogram
Construct a 95% confidence interval

How do we know if the slope of the true linear relation is 0?

Study These Flashcards

If the interval contains 0, we cannot reject the null hypothesis that the slope of the linear relation is 0.

Steps for Bootstrapping a Regression Confidence Interval

Study These Flashcards

resample table using sample(). This creates a new bootstrap sample of the data
extract the x and y columns
compute the correlation between x and y
compute slope and intercept
use regression line to predict y for a given x
repeat above steps 10,000 times, storing predictions
if using 95% CI, take 2.5th and 97.5th percentiles

Total Variation Difference (TVD)

Study These Flashcards

Use when you’re comparing two categorical distributions, like bar charts of proportions (e.g., gender, color, vote choice). It measures how far apart the two distributions are — 0 means identical, 1 means completely different. TVD is used with permutation tests when the variable is categorical.

Calculate the proportion of each category in group 1 and group 2
Subtract those proportions for each category
Take the absolute value of each difference (so it’s all positive)
Add them up
Multiply by ½

P-Value Formula

Study These Flashcards

p-value for right-tailed test: P (teststatistic ≥ observed)

p-value for left-tailed test: 𝑃 (teststatistic ≤
observed)

p-value for two-sided test:
P (∣teststatistic∣≥∣observed∣)

Overlaid Histograms

Use when: You want to compare distributions of a numerical variable across two or more categories (e.g., Prior Goals for wins vs. losses)

Scatterplot

Use when: You're examining the relationship between two numerical variables

Line Graph

Use when: You're tracking a numerical variable over time

Bar Chart

Use when: You're showing the frequency or proportion of categorical values

Steps of the bootstrap method for generating another random sample that resembles the population

treat the original sample as if it were the population draw from the sample, at random with replacement, as many times as the original sample size

Left-tail distribution formula when bootstrapping confidence level

right-tail distribution formula when bootstrapping confidence level

kNN regression

Measuring the distance between a new data point and all old, but SAME, points in the training set (using Euclidean distance). Selecting the k closest (nearest) neighbors. Averaging the target values (e.g., prices) of those k neighbors to make a prediction.

Euclidian distance

Data Science Final Flashcards

(33 cards)