Data Science Statistics Flashcards
Learned at WGU (181 cards)
Q: What does the Central Limit Theorem state?
A: Regardless of the population, the distribution of the sample means will approximate a normal distribution as the sample size increases.
Take repeated samples, calculate their means → plot those means → the result approaches a bell curve.
Python for Central Limit Theorem
import numpy as np
import matplotlib.pyplot as plt
samples = [np.mean(np.random.exponential(size=50)) for _ in range(1000)]
plt.hist(samples, bins=30, edgecolor=”black”)
plt.title(“Sample Means - CLT in Action”)
plt.show()
Q: What is the Probability Density Function (PDF)?
The Probability Density Function gives you the likelihood of a value occurring at an exact point, per unit of x.
pdf = norm.pdf(x, loc=0, scale=1)
plt.plot(x, pdf)
Q: What does the CDF tell us?
The Cumulative Distribution Function gives you the total probability that a variable is less than or equal to a certain value.
CDF(x=0) tells you “What’s the probability that X is less than or equal to 0?”
Area under the PDF curve up to value x.
Python for Cumulative Distribution Function (CDF)
from scipy.stats import norm
norm.cdf(1.96, loc=0, scale=1) # ≈ 0.975
loc is the mean of the normal distribution
scale is the standard deviation
Q: What does the Inverse CDF (or PPF) do?
Use the inverse CDF to determine the value of the variable associated with a specific probability. x = InvCDF(P)
Python for Inverse CDF
from scipy.stats import norm
norm.ppf(0.975, loc=0, scale=1) # ≈ 1.96
Q: What is a confidence interval?
A: A range of values within which we expect a population parameter to fall with a certain level of confidence.
Python for Confidence Interval
import scipy.stats as stats
mean = np.mean(sample)
computes the 95 percent CI for the population mean with df (Degrees of freedom = n -1), loc=mean centers the interval, and scale=sem is the standard error of the mean
ci = stats.t.interval(0.95, df=len(sample)-1, loc=mean, scale=sem)
Q: What does the p-value measure?
A: The probability of observing your data (or more extreme) if the null hypothesis is true.
By Hand:
Use z or t-tables based on the test statistic.
Python for P-Value
from scipy.stats import ttest_1samp
ttest_1samp(sample, popmean=52) # Returns t-statistic and p-value
Q: What’s the difference between a one-tailed and two-tailed test?
One-Tailed: Tests for an effect in one direction (e.g., greater than).
Two-Tailed: Tests for an effect in either direction.
Why Two-Tailed is Harder:
Because the alpha level (e.g., 0.05) is split between two tails (0.025 each), making it harder to reject the null.
Q: When is the t-distribution used instead of the normal distribution?
A: When sample size is small (n < 31)
When population standard deviation is unknown
Python for t-distribution
from scipy.stats import t
# Get the critical t-value
t.ppf(0.975, df=29) # For 95% confidence with df = 29
Q: What does it mean to partition your data?
A: It means splitting your dataset into training data (to teach your model) and test data (to check how well it learned).
Q: What does train_test_split() do in Python?
A: It randomly divides your dataset into training and testing sets so you can build and evaluate your model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
Q: Why should you validate your partitioned data?
A: To make sure both training and testing sets have a similar distribution (e.g., same class balance) — no surprises!
Q: What is data imbalance?
A: When one class (like “Yes”) appears way less often than another (like “No”). This can confuse your model into always guessing the majority class.
Q: How can you balance imbalanced data?
A: Use techniques like:
Oversampling (make more of the rare class)
Undersampling (remove from the common class)
Q: What is a baseline model?
A: A simple model that always guesses the most common class. You use it to see if your actual model is better than “just guessing.”
Q: How do you create a baseline model in Python?
A: Use DummyClassifier to always guess the most frequent class.
from sklearn.dummy import DummyClassifier
model = DummyClassifier(strategy=”most_frequent”)
Q: What’s the goal of a baseline model?
A: To set a performance floor — if your real model doesn’t beat the baseline, it’s not useful.
What is pd.crosstab() in Python?
pd.crosstab() is a pandas function that creates a cross-tabulation table (aka contingency table). It shows the frequency distribution of two (or more) categorical variables.