4. foundations of statistical interference Flashcards
(23 cards)
def statistical interference
the process of using data from a sample to make estimates or decisions about a population (the full group you care about).
Imagine you want to know what all voters in a country think about a policy. Instead of asking all 50 million people, you survey just 1,500. Statistical inference helps you draw conclusions from those 1,500 responses, and estimate what the whole population thinks.
def population; sample; population parameter; sample statistics
-pop= the full set of people or things you’re interested in (e.g., all voters).
-sample= smaller group drawn from the population (e.g., 1,500 surveyed voters).
-pop parameter= a true but unknown value (e.g., % of all voters who support a policy).
- sample statistics= an estimate based on the sample (e.g., % of surveyed voters who support it).
def random sampling
A random sample means every member of the population had an equal chance of being selected.
This minimizes bias and makes your estimates more accurate.
what is the Central Limit Theorem (CLT)
States that the distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population’s distribution.
Makes the normal curve a key tool for inference.
This is why surveys with 1,000–1,500 people can be reliable—if the sample is random (small sample can be enough)
=> No matter what the population distribution looks like, the distribution of the sample mean will approach a normal distribution as the sample size increases (usually 𝑛≥30 is “large enough”).
This applies to:
Sample means
Proportions
Differences in means or proportions
Why it matters:
It allows us to use the normal distribution for inference (using data from a sample to make educated guesses or conclusions about a larger population) even when the population is not normally distributed.
def random sampling error and standard error
*Random sampling error: the natural difference between the sample statistic and the actual population parameter.
*Standard error (SE): the average size of that error across many samples.
-Smaller SE → more precise estimate.
-SE gets smaller as the sample size increases.
what is the CI
Confidance interval: abt precision and certainty (95%-> see seminar CI (2))
gives a range in which we expect the true population parameter to lie.
For example, if 60% of a sample supports a policy, the 95% CI might be [57%, 63%].
This means we are 95% confident the true percentage of supporters in the whole population is between 57% and 63%.
def sampling distribution
the theoretical distribution of a statistic (like a mean) across all possible random samples from a population.
It shows us how much variation we can expect just by chance, and it helps define the standard error
-> EX: farm with 1,000 eggs. You’re trying to estimate the average weight of all your eggs.
Step 1: Take a sample
You randomly pick 5 eggs, weigh them, and calculate the average:
Sample 1: 58g, 62g, 59g, 61g, 60g → Sample mean = 60g
Then you do it again and again…
Step 2: Collect all the sample means
You now have 1,000 sample means: 60g, 59g, 61g, 58.5g, 60.2g, 59.8g, …
You plot those 1,000 means on a graph. It forms a sampling distribution of the mean — a new curve showing how sample means vary.
def distribution
describes how often different values of a variable occur. It can be:
Discrete (e.g., number of children: 0, 1, 2, …)
Continuous (e.g., height, weight, test scores)
Each distribution has:
A shape (e.g., bell-shaped, skewed)
A center (mean, median, or mode)
A spread (variance or standard deviation)
Graph: Distributions are often visualized with histograms or smooth curves (like the bell curve).
what is a normal distribution
a specific, widely used continuous distribution:
Bell-shaped and symmetric around the mean
Defined by two parameters:
-Mean (μ): the center
-Standard deviation (σ): how spread out the data are
🔢 The 68-95-99.7 Rule (Empirical Rule):
-About 68% of the values lie within + or - 1σ of the mean
-95% within ±2σ
-99.7% within ±3σ
🔁 Mode, Median, Mean
In a perfect normal distribution:
Mean = Median = Mode
what is standardization
to compare across different scales (ex: Math test scores out of 100 with Reaction times in seconds and Heights in centimeters), we use standardization:
A Z-score is a way to standardize a value — it tells you:
❝How far is this value from the average, measured in standard deviations?❞
Formula:
𝑍=(𝑋−𝜇)/ 𝜎
Where:
X = your value (e.g., a score)
μ (mu) = the mean (average)
σ (sigma) = the standard deviation (how spread out the data is
ex: The average test score is 70; The standard deviation is 10; A student scored 85
Then the Z-score is:
𝑍= (85−70)/10= 1.5
🔍 This means the student scored 1.5 standard deviations above the average.
The standard normal distribution has:
Mean = 0; SD = 1
This lets you apply probabilities and percentiles universally using the Z-table.
-> EX: 2 tests, one maths and one histroy: Your Math Z-score = 2 → You scored 2 standard deviations above the mean
Your History Z-score = 1
margin of error def
the range above and below your sample estimate where you expect the true population value to lie.
-> ME = Critical Value × SE; with:
*Critical Value depends on the confidence level (e.g., 95%, 99%) and comes from a probability distribution (z-distribution for large samples, t-distribution for small ones).
*SE is the standard error, which measures how much your sample statistic (e.g., the sample mean) would vary from sample to sample.
we have to trade off cost and precisions while making samples
Larger samples give more precise estimates (smaller SE → smaller ME).
But larger samples are more expensive or time-consuming to collect.
Hence, there’s a trade-off between the cost of collecting more data and the precision of the estimate.
what is the student’s t-distribution?
When the sample size is small (typically under 30 so not normal distribution), we use the Student’s t-distribution instead of the normal distribution:
*The t-distribution is wider (more uncertainty) and depends on degrees of freedom (sample size - 1).
*The smaller the sample, the fatter the tails of the distribution, leading to larger critical values → wider confidence intervals.
what are population size; population mean; and pop standard division
-pop size rep by N and assumed to be very large (bcs most of the time unknown)
-pop mean rep by mu (grec sign) and often unknown
-pop standard division rep by sigma (o in grec) and measures variation in a pop characteristic
what is the Z-score
-Converts raw scores into standard units
-Z score tells you how many standard deviations a value is from the mean in a normal distribution.
🔹 Example: In a population with mean μ = 100 and standard deviation σ = 15, if someone scores 130, then:
Z = (130 - 100)/15 = 2.0
This means the score is 2 standard deviations above the mean.
what is the x̄ (“x bar”)
-The sample mean
-It’s the average value from your sample and is used to estimate the population mean (μ).
🔹 Example: If 5 people have heights of 160, 165, 170, 175, and 180 cm, then:
x̄ = (160 + 165 + 170 + 175 + 180) / 5 = 170 cm
what is SE
standard error
Measures how much the sample mean (x̄) differs from the true mean (μ) just by chance SE tells us the likely “error” in using x̄ to estimate μ.
🔹 Example: If σ = 10 and n = 100, then SE = σ / √n = 10 / √100 = 1
what is s; n
Sample standard deviation
sample size
what is the 95% confidance interval
A range likely to contain the true population mean 95% of the time Calculated using: x̄ ± 1.96 × SE (if using Z).
🔹 Example: If x̄ = 50 and SE = 2, then 95% CI = 50 ± 1.96(2) = 50 ± 3.92 → [46.08, 53.92]
what is P
Proportion of sample with a certain value
what is q; standard error of a sample proportion
The complement of p (1 – p) Used when estimating the standard error of proportions.
🔹 Example: If p = 0.2, then q = 1 – 0.2 = 0.8
How much p might vary from the real population proportion Formula: SE = √[pq / n]
🔹 Example: If p = 0.2, q = 0.8, and n = 1000:
SE = √[(0.2)(0.8)/1000] = √[0.16/1000] ≈ 0.0126
This SE is then used in CI: p ± 1.96 × SE → 0.2 ± 0.025 → [0.175, 0.225]
what is the difference btwn dispersion and distribution
- Distribution: The overall pattern of how data values are spread across the range of values.
*Focuses on: The shape, center, and spread of the data.
*Examples of distribution types:
-Normal distribution (bell-shaped)
-Skewed distribution (left/right)
-Uniform distribution (evenly spread)
-Bimodal distribution (two peaks) - Dispersion: The degree to which data values vary from each other and from the average (mean/median).
*Focuses on: How spread out the values are within the distribution.
*Common measures of dispersion:
-Range (max - min)
-Variance
-Standard deviation
-Interquartile range (IQR)
what are the 2 types of distribution?
*uniform distibution (random sampling; mean is the center +sd)
*Bernoulli distribution(ex coin: 50%-50% pile ou face)