1. Descriptive Statistics Flashcards
(33 cards)
variable def
An empirical measurement of a characteristic (something we want to observe or measure).
+ Every variable has a name and at least two values.
to find it:
-Name = what are we measuring? (e.g., “Country”, “Population”)
-Values = the actual data we observe for different cases (e.g., “Germany” or “83.2 million”)
def unit of analyse
who or what you are analyzing.
It’s the entity you want to describe using variables.
Examples:
-comparing countries → unit of analysis = country
-analyzing people → unit of analysis = individual
def levels of measurment
how we can categorize or quantify variables, and they matter because different statistical methods apply depending on the level.
what are the levels of Measurement in Statistics
nominal-level variables; ordinal-level variables; interval-level variables; ratio-level variables
explain Nominal-level variables
📌 Least precise
What it tells us: Differences between units only (categories)
Examples: Country name, religion, political party, gender
Numeric codes: Used only as labels, not quantities (e.g., 1 = Germany, 2 = France)
Key properties:
-Categories only
-Mutually exclusive (each unit fits into one category)
-Exhaustive (all possible options are listed)
✅ NO order or ranking
ordinal-level variables
📌 More precise than nominal
What it tells us: Relative ranking between units of analysis
Examples: Level of democracy (low, medium, high), education level, military power rank
Numeric codes: Reflect order (e.g., 1 = low, 2 = medium, 3 = high)
Key properties:
-Categories with rank order
-But we don’t know the exact distance between them
✅ Ranking possible, but intervals are unknown
interval-level variables
📌 Even more precise
What it tells us: Exact differences between values
Examples: Temperature in Celsius, years (e.g., 1945, 2001), IQ scores
Key properties:
-Numeric values with equal spacing between them
-Can calculate differences (e.g., 2020 − 2000 = 20 years)
-❌ No true zero (0 doesn’t mean absence)
✅ Use of addition/subtraction valid, but not multiplication/division
ratio-level variable
📌 Most precise
What it tells us: Exact values, with a true zero
Examples: Population size, GDP, number of conflicts, age
Key properties:
-All features of interval-level
-True zero: 0 = none of the variable
-Can do all arithmetic operations (e.g., Country A has 2× the GDP of Country B)
✅ Discrete (whole numbers) or continuous (decimals)
Def descriptive statistics
help us summarize and understand datasets by describing two things: measures of central tendency and Measures of Dispersion (or Spread)
measures of central tendency
tools used to find the center or typical value in a dataset
There are three main measures:
*mode (value that appears most often in a dataset ex if dataset: blue, blue, red then the mode is blue bcs it is the data that appares the more often)
*median (when the data is sorted in order ex: Dataset: 1, 3, 5, 8, 10
Median = 5 (2 values below, 2 values above) but if dataset: 1, 3, 5, 8; it is (3+5)/2)
*mean (Add up all values, then divide by the number of values)
!!!A dataset can have more than one mode (this is called bimodal or multimodal).
If all values occur the same number of times, there’s no mode!!!
what is the frequency distribution of nominal variables
(ex: 30 married ppl out of 100)
A table that shows how often each category appears:
*Raw frequency = actual number of cases (e.g., 30 people are married)
*Total frequency = total number of all cases (ex 100)
*Proportion = raw frequency / total frequency (adds up to 1) (ex 30/100)
*Percentage = proportion × 100 (adds up to 100%) (ex 30/100 * 100=30%)
def bar chart
only for nominal-level variables (or ordinal if you want just to compare)
X-axis = categories (labels/values)
Y-axis = frequencies or percentages
Height of each bar = how many cases fall into each category
what are the 3 types of dispersion?
(Variation in responses)
*Greatest dispersion = all categories equally common
-> every category has the same nb of cases, there is no stronge category
*Least dispersion = all cases in one category
-> everyone picked the same category so minimum variation
*Variation ratio = proportion of cases not in the mode
Useful to gauge how spread out responses are
-> this is for ctegorical variables but for numerical, you use standard deviation to see how spread the data is)
what are the key measures of ordinal variables?
what are the 2 dispersions possible
what are the heuristics (mental shortcuts for solving problems in a quick way)?
*Median = 50th percentile (middle value)
*Mode = most frequent value
*Cumulative percentage = % at or below a given value
*Percentile = value below which a certain % of data falls
-Range = distance between lowest and highest categories
-Interquartile Range (IQR) = middle 50% spread
High dispersion = responses spread across categories
If mode and median are far apart, dispersion is likely high
what are the central tendencies of interval variable;
what are the skewness (asyméties)?
what is the dispersion?
Measures of Central Tendency:
Mean = average
Median = middle value
Mode = most frequent value
Skewness:
Positive skew (right): mean > median
Negative skew (left): mean < median
Dispersion:
Range = max - min
IQR = 75th percentile - 25th percentile
(Middle half of the data)
what Measures of Dispersion for which variable level?
Nominal (only categories so none, maybe variation ratio)
ordinal (range or IQR)
interval (range, IQR, sd, variance)
are measures of dispersion low or high according to level of the variables
*if One mode is clearly dominant for Nominal: low dispersion (If most of the data fall into a single category (e.g., 90% identify with one religion), there is little variation among cases. That means low dispersion — almost everyone is in the same group)
- for ordinal, low dispersion when responses are clustered around the median rank (If most responses are near the same rank (e.g., mostly “satisfied” or “neutral”), the range of ranks is small, and the data is concentrated).
*Interval: low dispersion when values clustered around the mean (When values don’t stray far from the mean, the average distance (standard deviation) is small. This means the data points are close together = low dispersion)
what abt the skew dep on level
nominal: no relationship
ordinal: can be + or -
interval: can be + or - AND quantifiable
what is standard deviation?
Measures how much individual values deviate from the mean
It tells you how “spread out” the data is
Five Steps to Calculate Standard Deviation (sample):
*Step 1: Find each value’s deviation from the mean (value − mean)
*Step 2: Square each deviation (value − mean)²
*Step 3: Sum all the squared deviations
∑(value − mean)²
*Step 4: Find the variance:
Variance = sum / (n − 1)
(we divide by n − 1 for sample data to correct bias)
*Step 5: Take the square root of the variance
SD = √variance
EXAMPLE: Let’s say we surveyed 5 people and asked how many hours of political news they watch per week. Their responses were: [2, 4, 4, 6, 14]
s1: Find the Mean (average) mean= (2+4+4+6+14)/5=6
AND Find Deviations from the Mean:
2 − 6 = −4
4 − 6 = −2
4 − 6 = −2
6 − 6 = 0
14 − 6 = 8
-> derivation from the mean (value-mean)= -4; -2; -2; 0; 8
s2: square each deviation: (-4)²=16; (-2)²=4; 0; 8²=64
s3: sum of the squared deviations: 16+4+0+4+64=88
s4: find the variance: sum/(n-1) so 88/ (5-1)= 22
s5: SD= √22 =4.69
-> so tells us how much, on average, the data points deviate from the mean (on average, people watch about 4.69 hours more or less than the mean of 6 hours).
what is a Boxplot (Box-and-Whiskers Plot)
explained step by step in the next cards
A compact summary of the distribution of a variable
-> to create a boxplot, we need:
*Q1 (25th percentile) = 4
Q2 (Median) = 4
Q3 (75th percentile) = 6
*IQR = Q3 − Q1 = 6 − 4 = 2
*Now calculate adjacent values (whiskers):
Lower adjacent value = Q1 − 1.5 × IQR = 4 − 3 = 1
Upper adjacent value = Q3 + 1.5 × IQR = 6 + 3 = 9
👉 Since 14 is above 9, it is an outlier.
*Boxplot Summary:
Min (non-outlier): 2
Q1: 4
Q2 (Median): 4
Q3: 6
Max (non-outlier): 6
Outlier: 14
what are quartiles
they help describe how a data set is divided into four equal parts when sorted in order. They are especially useful in identifying the spread of the data and spotting outliers.
*Q1 (First Quartile / 25th percentile)
It marks the point below which 25% of the data falls.
*Q2 (Second Quartile / Median / 50th percentile)
This is the middle value of your data.
It splits the dataset in half — 50% of the values are below it, 50% above.
*Q3 (Third Quartile / 75th percentile)
This is the median of the upper half of your data.
It marks the point below which 75% of the data falls.
=> in the example surveyed 5 people and asked how many hours of political news they watch per week. Their responses were: [2, 4, 4, 6, 14]
*Q1:Lower half of the data (before the median) = [2, 4]
The median of [2, 4] is 3
*Q2: Since there are 5 values, the middle one is the 3rd: Q2=4
*Q3: Upper half of the data (after the median) = [6, 14]
The median of [6, 14] is 10
what is IQR
Interquartile Range
QR = Q3 − Q1
→ This gives us the range of the middle 50% of the data
In our example:
𝐼𝑄𝑅=10−3=7
It tells you how spread out the central portion of the data is.
how can we find outliers (valeurs aberrantes) with quartiles?
We use the IQR rule:
Lower Bound = Q1 − 1.5 × IQR = 3 − 1.5×7 = −7.5
Upper Bound = Q3 + 1.5 × IQR = 10 + 10.5 = 20.5
Any value outside this range is an outlier.
Histogram vS Bar Chart (data type; bars touch?; purpose)
Histogram vS Bar Chart:
Data Type: numerical vS categorical
X-axis: Continuous scale (numbers) vS Discrete categories (labels)
Bars Touch?: Yes (shows continuity) vS No (categories are separate)
Purpose: Show distribution of values vS Compare category frequencies