1. Descriptive Statistics Flashcards
(28 cards)
variable def
An empirical measurement of a characteristic (something we want to observe or measure).
+ Every variable has a name and at least two values.
to find it:
-Name = what are we measuring? (e.g., “Country”, “Population”)
-Values = the actual data we observe for different cases (e.g., “Germany” or “83.2 million”)
def unit of analyse
who or what you are analyzing.
It’s the entity you want to describe using variables.
Examples:
-comparing countries → unit of analysis = country
-analyzing people → unit of analysis = individual
def levels of measurment
how we can categorize or quantify variables, and they matter because different statistical methods apply depending on the level.
what are the levels of Measurement in Statistics
nominal-level variables; ordinal-level variables; interval-level variables; ratio-level variables
explain Nominal-level variables
📌 Least precise
What it tells us: Differences between units only (categories)
Examples: Country name, religion, political party, gender
Numeric codes: Used only as labels, not quantities (e.g., 1 = Germany, 2 = France)
Key properties:
-Categories only
-Mutually exclusive (each unit fits into one category)
-Exhaustive (all possible options are listed)
✅ NO order or ranking
ordinal-level variables
📌 More precise than nominal
What it tells us: Relative ranking between units of analysis
Examples: Level of democracy (low, medium, high), education level, military power rank
Numeric codes: Reflect order (e.g., 1 = low, 2 = medium, 3 = high)
Key properties:
-Categories with rank order
-But we don’t know the exact distance between them
✅ Ranking possible, but intervals are unknown
interval-level variables
📌 Even more precise
What it tells us: Exact differences between values
Examples: Temperature in Celsius, years (e.g., 1945, 2001), IQ scores
Key properties:
-Numeric values with equal spacing between them
-Can calculate differences (e.g., 2020 − 2000 = 20 years)
-❌ No true zero (0 doesn’t mean absence)
✅ Use of addition/subtraction valid, but not multiplication/division
ratio-level variable
📌 Most precise
What it tells us: Exact values, with a true zero
Examples: Population size, GDP, number of conflicts, age
Key properties:
-All features of interval-level
-True zero: 0 = none of the variable
-Can do all arithmetic operations (e.g., Country A has 2× the GDP of Country B)
✅ Discrete (whole numbers) or continuous (decimals)
Def descriptive statistics
help us summarize and understand datasets by describing two things: measures of central tendency and Measures of Dispersion (or Spread)
measures of central tendency
tools used to find the center or typical value in a dataset
There are three main measures:
*mode (value that appears most often in a dataset ex if dataset: blue, blue, red then the mode is blue bcs it is the data that appares the more often)
*median (when the data is sorted in order ex: Dataset: 1, 3, 5, 8, 10
Median = 5 (2 values below, 2 values above) but if dataset: 1, 3, 5, 8; it is (3+5)/2)
*mean (Add up all values, then divide by the number of values)
!!!A dataset can have more than one mode (this is called bimodal or multimodal).
If all values occur the same number of times, there’s no mode!!!
what is the frequency distribution of nominal variables
(ex: 30 married ppl out of 100)
A table that shows how often each category appears:
*Raw frequency = actual number of cases (e.g., 30 people are married)
*Total frequency = total number of all cases (ex 100)
*Proportion = raw frequency / total frequency (adds up to 1) (ex 30/100)
*Percentage = proportion × 100 (adds up to 100%) (ex 30/100 * 100=30%)
how is the bar chart of nominal values?
X-axis = categories (labels/values)
Y-axis = frequencies or percentages
Height of each bar = how many cases fall into each category
what are the 3 types of dispersion?
(Variation in responses)
*Greatest dispersion = all categories equally common
-> every category has the same nb of cases, there is no stronge category
*Least dispersion = all cases in one category
-> everyone picked the same category so minimum variation
*Variation ratio = proportion of cases not in the mode
Useful to gauge how spread out responses are
-> this is for ctegorical variables but for numerical, you use standard deviation to see how spread the data is)
what are the key measures of ordinal variables?
what are the 2 dispersions possible
what are the heuristics (mental shortcuts for solving problems in a quick way)?
*Median = 50th percentile (middle value)
*Mode = most frequent value
*Cumulative percentage = % at or below a given value
*Percentile = value below which a certain % of data falls
-Range = distance between lowest and highest categories
-Interquartile Range (IQR) = middle 50% spread
High dispersion = responses spread across categories
If mode and median are far apart, dispersion is likely high
what are the central tendencies of interval variable;
what are the skewness (asyméties)?
what is the dispersion?
Measures of Central Tendency:
Mean = average
Median = middle value
Mode = most frequent value
Skewness:
Positive skew (right): mean > median
Negative skew (left): mean < median
Dispersion:
Range = max - min
IQR = 75th percentile - 25th percentile
(Middle half of the data)
what is standard deviation?
Measures how much individual values deviate from the mean
It tells you how “spread out” the data is
Five Steps to Calculate Standard Deviation (sample):
*Step 1: Find each value’s deviation from the mean (value − mean)
*Step 2: Square each deviation (value − mean)²
*Step 3: Sum all the squared deviations
∑(value − mean)²
*Step 4: Find the variance:
Variance = sum / (n − 1)
(we divide by n − 1 for sample data to correct bias)
*Step 5: Take the square root of the variance
SD = √variance
EXAMPLE: Let’s say we surveyed 5 people and asked how many hours of political news they watch per week. Their responses were: [2, 4, 4, 6, 14]
s1: Find the Mean (average) mean= (2+4+4+6+14)/5=6
AND Find Deviations from the Mean:
2 − 6 = −4
4 − 6 = −2
4 − 6 = −2
6 − 6 = 0
14 − 6 = 8
-> derivation from the mean (value-mean)= -4; -2; -2; 0; 8
s2: square each deviation: (-4)²=16; (-2)²=4; 0; 8²=64
s3: sum of the squared deviations: 16+4+0+4+64=88
s4: find the variance: sum/(n-1) so 88/ (5-1)= 22
s5: SD= √22 =4.69
-> so tells us how much, on average, the data points deviate from the mean (on average, people watch about 4.69 hours more or less than the mean of 6 hours).
what is a Boxplot (Box-and-Whiskers Plot)
explained step by step in the next cards
A compact summary of the distribution of a variable
-> to create a boxplot, we need:
*Q1 (25th percentile) = 4
Q2 (Median) = 4
Q3 (75th percentile) = 6
*IQR = Q3 − Q1 = 6 − 4 = 2
*Now calculate adjacent values (whiskers):
Lower adjacent value = Q1 − 1.5 × IQR = 4 − 3 = 1
Upper adjacent value = Q3 + 1.5 × IQR = 6 + 3 = 9
👉 Since 14 is above 9, it is an outlier.
*Boxplot Summary:
Min (non-outlier): 2
Q1: 4
Q2 (Median): 4
Q3: 6
Max (non-outlier): 6
Outlier: 14
what are quartiles
they help describe how a data set is divided into four equal parts when sorted in order. They are especially useful in identifying the spread of the data and spotting outliers.
*Q1 (First Quartile / 25th percentile)
It marks the point below which 25% of the data falls.
*Q2 (Second Quartile / Median / 50th percentile)
This is the middle value of your data.
It splits the dataset in half — 50% of the values are below it, 50% above.
*Q3 (Third Quartile / 75th percentile)
This is the median of the upper half of your data.
It marks the point below which 75% of the data falls.
=> in the example surveyed 5 people and asked how many hours of political news they watch per week. Their responses were: [2, 4, 4, 6, 14]
*Q1:Lower half of the data (before the median) = [2, 4]
The median of [2, 4] is 3
*Q2: Since there are 5 values, the middle one is the 3rd: Q2=4
*Q3: Upper half of the data (after the median) = [6, 14]
The median of [6, 14] is 10
what is IQR
Interquartile Range
QR = Q3 − Q1
→ This gives us the range of the middle 50% of the data
In our example:
𝐼𝑄𝑅=10−3=7
It tells you how spread out the central portion of the data is.
how can we find outliers (valeurs aberrantes) with quartiles?
We use the IQR rule:
Lower Bound = Q1 − 1.5 × IQR = 3 − 1.5×7 = −7.5
Upper Bound = Q3 + 1.5 × IQR = 10 + 10.5 = 20.5
Any value outside this range is an outlier.
Histogram vS Bar Chart (data type; bars touch?; purpose)
Histogram vS Bar Chart:
Data Type: Interval / numerical vS Nominal / categorical
X-axis: Continuous scale (numbers) vS Discrete categories (labels)
Bars Touch?: Yes (shows continuity) vS No (categories are separate)
Purpose: Show distribution of values vS Compare category frequencies
what is the Kernel Density Plot
A smooth curve that shows the estimated distribution of a variable (smoothed version of a histogram)
-> Helps visualize where values are concentrated over a range
📍Why use it?
It removes the “bin” problem in histograms
Better at showing the shape of the distribution, especially for large data sets
what is the “bin” pb of histograms
A histogram groups data into bins (intervals). Each bin covers a range of values, and the height of the bar shows how many values fall into that range.
-> problem:
*Binning is arbitrary — You choose how wide the bins are (e.g., 0–5, 6–10… or 0–10, 11–20…).
*Different bin widths can change how the data looks — they can:
*Hide important patterns (if bins are too wide)
*Exaggerate randomness or noise (if bins are too narrow)
*Make a skewed distribution look symmetric — or vice versa
how are mean and median with outliners
median: Resistant to outliers (not affected much by extreme values)
Mean: Sensitive to outliers — one very high or very low value can pull it up or down