4. foundations of statistical interference Flashcards
(22 cards)
def statistical interference (=déductions)
the process of using data from a sample to make estimates or decisions about a population (the full group you care about).
Imagine you want to know what all voters in a country think about a policy. Instead of asking all 50 million people, you survey just 1,500. Statistical inference helps you draw conclusions from those 1,500 responses, and estimate what the whole population thinks.
def population; sample; population parameter; sample statistics
-pop= the full set of people or things you’re interested in (e.g., all voters).
-sample= smaller group drawn from the population (e.g., 1,500 surveyed voters).
-pop parameter= a true but unknown value (e.g., % of all voters who support a policy).
- sample statistics= an estimate based on the sample (e.g., % of surveyed voters who support it).
def random sampling
A random sample means every member of the population had an equal chance of being selected.
This minimizes bias and makes your estimates more accurate.
what is the Central Limit Theorem (CLT)
When you take repeated random samples from a population and calculate their means, the distribution of those means (the sampling distribution of the mean) tends toward a normal distribution as the sample size increases — regardless of the shape of the population distribution.
2 cases:
*If the population is already normally distributed:
Then even small samples (e.g., less than 30) will produce sample means that are normally distributed.
*If the population is not normally distributed: Then you usually need a sample size of 30 or more for the CLT to apply and for the sample means to approximate a normal distribution.
It allows us to use the normal distribution for inference even when the population is not normally distributed.
what does sampling distribution of the mean means?
When you take repeated random samples from a population and calculate their means, it is the distribution of those means
def random sampling error and standard error
*Random sampling error: the natural difference between the sample statistic ( an estimate of the population parameter) and the actual population parameter.
*Standard error (SE): the average size of that error across many samples.
-Smaller SE → more precise estimate.
-SE gets smaller as the sample size increases.
what is the CI
Confidence interval: abt precision and certainty (95%-> see seminar CI (2))
gives a range in which we expect the true population parameter to lie.
For example, if 60% of a sample supports a policy, the 95% CI might be [57%, 63%].
This means we are 95% confident the true percentage of supporters in the whole population is between 57% and 63%.
another ex: A 95% confidence interval means that in the long run, 95% of such intervals will contain the true parameter
def sampling distribution
the theoretical distribution of a statistic (like a mean) across all possible random samples from a population.
It shows us how much variation we can expect just by chance, and it helps define the standard error
-> EX of sampling distribution (here with the mean): farm with 1,000 eggs. You’re trying to estimate the average weight of all your eggs.
Step 1: Take a sample
You randomly pick 5 eggs, weigh them, and calculate the average:
Sample 1: 58g, 62g, 59g, 61g, 60g → Sample mean = 60g
Then you do it again and again…
Step 2: Collect all the sample means
You now have 1,000 sample means: 60g, 59g, 61g, 58.5g, 60.2g, 59.8g, …
You plot those 1,000 means on a graph. It forms a sampling distribution of the mean — a new curve showing how sample means vary.
def distribution
describes how often different values of a variable occur. It can be:
Discrete (e.g., number of children: 0, 1, 2, …)
Continuous (e.g., height, weight, test scores)
Each distribution has:
A shape (e.g., bell-shaped, skewed)
A center (mean, median, or mode)
A spread (variance or standard deviation)
Graph: Distributions are often visualized with histograms or smooth curves (like the bell curve).
difference btwn normal and population distribution
ex: Let’s say you’re studying public support for joining NATO in a region:
The population distribution is the real distribution of support across all people in that region. Maybe it’s skewed if one age group dominates.
You take a sample and assume that the sampling distribution of the mean (according to the Central Limit Theorem) is normally distributed, which lets you calculate a confidence interval.
what is a normal distribution
specific, continuous distribution:
Bell-shaped and symmetric around the mean
Defined by two parameters:
-Mean (μ): the center
-Standard deviation (σ): how spread out the data are
🔢 The 68-95-99.7 Rule (Empirical Rule):
-About 68% of the values lie within + or - 1σ of the mean
-95% within ±2σ
-99.7% within ±3σ
🔁 Mode, Median, Mean
In a perfect normal distribution: Mean = Median = Mode
ex: If you’re studying public support for sanctions in different EU countries and survey 1,000 people in each country, the distribution of responses (like support level from 1 to 10) might form a normal distribution. You could then say: “In most countries, support centers around 6, with few people giving very low (1–2) or very high (9–10) scores.
what is standardization
to compare across different scales (ex: Math test scores out of 100 with Reaction times in seconds and Heights in centimeters), we use standardization:
A Z-score is a way to standardize a value — it tells you:
❝How far is this value from the average, measured in standard deviations?❞
Formula:
𝑍=(𝑋−𝜇)/ 𝜎
Where:
X = your value (e.g., a score)
μ (mu) = the mean (average)
σ (sigma) = the standard deviation (how spread out the data is
ex: The average test score is 70; The standard deviation is 10; A student scored 85
Then the Z-score is:
𝑍= (85−70)/10= 1.5
🔍 This means the student scored 1.5 standard deviations above the average.
The standard normal distribution has:
Mean = 0; SD = 1
This lets you apply probabilities and percentiles universally using the Z-table.
-> EX: 2 tests, one maths and one histroy: Your Math Z-score = 2 → You scored 2 standard deviations above the mean
Your History Z-score = 1
margin of error def
the range above and below your sample estimate where you expect the true population value to lie.
-> ME = Critical Value × SE; with:
*Critical Value depends on the confidence level (e.g., 95%, 99%) and comes from a probability distribution (z-distribution for large samples, t-distribution for small ones).
*SE is the standard error, which measures how much your sample statistic (e.g., the sample mean) would vary from sample to sample.
we have to trade off cost and precisions while making samples
Larger samples give more precise estimates (smaller SE → smaller ME).
But larger samples are more expensive or time-consuming to collect.
Hence, there’s a trade-off between the cost of collecting more data and the precision of the estimate.
what is the student’s t-distribution?
When the sample size is small (typically under 30), we use the Student’s t-distribution instead of the normal distribution:
*The t-distribution is wider (more uncertainty) and depends on degrees of freedom (sample size - 1).
*The smaller the sample, the fatter the tails of the distribution, leading to larger critical values → wider confidence intervals.
what are population size; population mean; x̄ (“x bar”); and pop standard deviation
-pop size: rep by N and assumed to be very large (bcs most of the time unknown)
-pop mean: rep by mu (grec sign) and often unknown
- x̄ (“x bar”): the sample mean
-pop standard deviation: rep by sigma (o in grec) and measures variation in a pop characteristic
what is s; n; P; q
Sample standard deviation
sample size
Proportion of sample with a certain value
q=p bar
standard error of a sample proportion
How much p might vary from the real population proportion
Formula: SE = √[pq / n]
🔹 Example: If p = 0.2, q = 0.8, and n = 1000:
SE = √[(0.2)(0.8)/1000] = √[0.16/1000] ≈ 0.0126
This SE is then used in CI: p ± 1.96 × SE → 0.2 ± 0.025 → [0.175, 0.225]
what is the difference btwn dispersion and distribution and deviation
- Distribution: The overall pattern of how data values are spread across the range of values.
*Focuses on: The shape, center, and spread of the data.
*Examples of distribution types:
-Normal distribution (bell-shaped)
-Uniform distribution (evenly spread)
-Bernoulli distribution (sucess-failure) - Dispersion: The degree to which data values vary from each other and from the average (mean/median).
*Focuses on: How spread out the values are within the distribution.
*Common measures of dispersion:
-Range (max - min)
-Variance
-Standard deviation
-Interquartile range (IQR) - deviation: difference between an individual data point and the mean (average) of the dataset.
what are the 2 types of distribution?
*uniform distibution (random sampling; mean is the center +sd)
*Bernoulli distribution (ex coin: 50%-50% pile ou face)
what is population shape?
the distribution of values in the entire population. It can be:
Normal: bell-shaped and symmetric
Skewed: values are stretched more to one side (right or left)
Uniform: all values are equally likely
Bimodal or Multimodal: has two or more peaks
ex: normal: wolen’s height in a country (almost everyone around the mean and few bigger or smaller)
Knowing the shape is useful for choosing appropriate statistical methods.
random sampling =/ random assignment
- Random Sampling — WHO gets into the study
🔹 Definition:
A method of selecting individuals at random from a larger population so that each person has an equal chance of being included in the sample.
🔹 Purpose:
To ensure the sample represents the population → increases external validity (generalizability).
🔹 Example (IR context):
You want to study public opinion on international aid. You randomly select 1,000 citizens from across 50 countries to participate in a survey.
🎲 2. Random Assignment — WHO gets what treatment
🔹 Definition:
Once you have your sample, you randomly assign participants to different groups (e.g., treatment vs. control).
🔹 Purpose:
To ensure that differences between groups are due to the treatment, not pre-existing differences → increases internal validity (causal inference).
🔹 Example (IR context):
From your sample of 1,000, you randomly assign half to receive a news article about aid success, and the other half to receive a neutral article. Then you compare how opinions differ between groups.