1. Descriptive Statistics Flashcards

(28 cards)

1
Q

variable def

A

An empirical measurement of a characteristic (something we want to observe or measure).
+ Every variable has a name and at least two values.

to find it:
-Name = what are we measuring? (e.g., “Country”, “Population”)
-Values = the actual data we observe for different cases (e.g., “Germany” or “83.2 million”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

def unit of analyse

A

who or what you are analyzing.
It’s the entity you want to describe using variables.

Examples:
-comparing countries → unit of analysis = country
-analyzing people → unit of analysis = individual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

def levels of measurment

A

how we can categorize or quantify variables, and they matter because different statistical methods apply depending on the level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are the levels of Measurement in Statistics

A

nominal-level variables; ordinal-level variables; interval-level variables; ratio-level variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

explain Nominal-level variables

A

📌 Least precise
What it tells us: Differences between units only (categories)
Examples: Country name, religion, political party, gender
Numeric codes: Used only as labels, not quantities (e.g., 1 = Germany, 2 = France)
Key properties:
-Categories only
-Mutually exclusive (each unit fits into one category)
-Exhaustive (all possible options are listed)
✅ NO order or ranking

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

ordinal-level variables

A

📌 More precise than nominal
What it tells us: Relative ranking between units of analysis
Examples: Level of democracy (low, medium, high), education level, military power rank
Numeric codes: Reflect order (e.g., 1 = low, 2 = medium, 3 = high)
Key properties:
-Categories with rank order
-But we don’t know the exact distance between them
✅ Ranking possible, but intervals are unknown

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

interval-level variables

A

📌 Even more precise
What it tells us: Exact differences between values
Examples: Temperature in Celsius, years (e.g., 1945, 2001), IQ scores
Key properties:
-Numeric values with equal spacing between them
-Can calculate differences (e.g., 2020 − 2000 = 20 years)
-❌ No true zero (0 doesn’t mean absence)
✅ Use of addition/subtraction valid, but not multiplication/division

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

ratio-level variable

A

📌 Most precise
What it tells us: Exact values, with a true zero
Examples: Population size, GDP, number of conflicts, age
Key properties:
-All features of interval-level
-True zero: 0 = none of the variable
-Can do all arithmetic operations (e.g., Country A has 2× the GDP of Country B)
✅ Discrete (whole numbers) or continuous (decimals)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Def descriptive statistics

A

help us summarize and understand datasets by describing two things: measures of central tendency and Measures of Dispersion (or Spread)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

measures of central tendency

A

tools used to find the center or typical value in a dataset
There are three main measures:
*mode (value that appears most often in a dataset ex if dataset: blue, blue, red then the mode is blue bcs it is the data that appares the more often)
*median (when the data is sorted in order ex: Dataset: 1, 3, 5, 8, 10
Median = 5 (2 values below, 2 values above) but if dataset: 1, 3, 5, 8; it is (3+5)/2)
*mean (Add up all values, then divide by the number of values)

!!!A dataset can have more than one mode (this is called bimodal or multimodal).
If all values occur the same number of times, there’s no mode!!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the frequency distribution of nominal variables

A

(ex: 30 married ppl out of 100)
A table that shows how often each category appears:
*Raw frequency = actual number of cases (e.g., 30 people are married)
*Total frequency = total number of all cases (ex 100)
*Proportion = raw frequency / total frequency (adds up to 1) (ex 30/100)
*Percentage = proportion × 100 (adds up to 100%) (ex 30/100 * 100=30%)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how is the bar chart of nominal values?

A

X-axis = categories (labels/values)

Y-axis = frequencies or percentages

Height of each bar = how many cases fall into each category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are the 3 types of dispersion?

A

(Variation in responses)
*Greatest dispersion = all categories equally common
-> every category has the same nb of cases, there is no stronge category
*Least dispersion = all cases in one category
-> everyone picked the same category so minimum variation
*Variation ratio = proportion of cases not in the mode

Useful to gauge how spread out responses are
-> this is for ctegorical variables but for numerical, you use standard deviation to see how spread the data is)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what are the key measures of ordinal variables?
what are the 2 dispersions possible
what are the heuristics (mental shortcuts for solving problems in a quick way)?

A

*Median = 50th percentile (middle value)
*Mode = most frequent value
*Cumulative percentage = % at or below a given value
*Percentile = value below which a certain % of data falls

-Range = distance between lowest and highest categories
-Interquartile Range (IQR) = middle 50% spread

High dispersion = responses spread across categories
If mode and median are far apart, dispersion is likely high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what are the central tendencies of interval variable;
what are the skewness (asyméties)?
what is the dispersion?

A

Measures of Central Tendency:
Mean = average
Median = middle value
Mode = most frequent value

Skewness:
Positive skew (right): mean > median
Negative skew (left): mean < median

Dispersion:
Range = max - min
IQR = 75th percentile - 25th percentile
(Middle half of the data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is standard deviation?

A

Measures how much individual values deviate from the mean
It tells you how “spread out” the data is

Five Steps to Calculate Standard Deviation (sample):
*Step 1: Find each value’s deviation from the mean (value − mean)
*Step 2: Square each deviation (value − mean)²
*Step 3: Sum all the squared deviations
∑(value − mean)²
*Step 4: Find the variance:
Variance = sum / (n − 1)
(we divide by n − 1 for sample data to correct bias)
*Step 5: Take the square root of the variance
SD = √variance

EXAMPLE: Let’s say we surveyed 5 people and asked how many hours of political news they watch per week. Their responses were: [2, 4, 4, 6, 14]

s1: Find the Mean (average) mean= (2+4+4+6+14)/5=6
AND Find Deviations from the Mean:
2 − 6 = −4
4 − 6 = −2
4 − 6 = −2
6 − 6 = 0
14 − 6 = 8
-> derivation from the mean (value-mean)= -4; -2; -2; 0; 8
s2: square each deviation: (-4)²=16; (-2)²=4; 0; 8²=64
s3: sum of the squared deviations: 16+4+0+4+64=88
s4: find the variance: sum/(n-1) so 88/ (5-1)= 22
s5: SD= √22 =4.69
-> so tells us how much, on average, the data points deviate from the mean (on average, people watch about 4.69 hours more or less than the mean of 6 hours).

17
Q

what is a Boxplot (Box-and-Whiskers Plot)
explained step by step in the next cards

A

A compact summary of the distribution of a variable
-> to create a boxplot, we need:
*Q1 (25th percentile) = 4
Q2 (Median) = 4
Q3 (75th percentile) = 6

*IQR = Q3 − Q1 = 6 − 4 = 2

*Now calculate adjacent values (whiskers):
Lower adjacent value = Q1 − 1.5 × IQR = 4 − 3 = 1
Upper adjacent value = Q3 + 1.5 × IQR = 6 + 3 = 9
👉 Since 14 is above 9, it is an outlier.

*Boxplot Summary:
Min (non-outlier): 2
Q1: 4
Q2 (Median): 4
Q3: 6
Max (non-outlier): 6
Outlier: 14

18
Q

what are quartiles

A

they help describe how a data set is divided into four equal parts when sorted in order. They are especially useful in identifying the spread of the data and spotting outliers.

*Q1 (First Quartile / 25th percentile)
It marks the point below which 25% of the data falls.
*Q2 (Second Quartile / Median / 50th percentile)
This is the middle value of your data.
It splits the dataset in half — 50% of the values are below it, 50% above.
*Q3 (Third Quartile / 75th percentile)
This is the median of the upper half of your data.
It marks the point below which 75% of the data falls.

=> in the example surveyed 5 people and asked how many hours of political news they watch per week. Their responses were: [2, 4, 4, 6, 14]

*Q1:Lower half of the data (before the median) = [2, 4]
The median of [2, 4] is 3
*Q2: Since there are 5 values, the middle one is the 3rd: Q2=4
*Q3: Upper half of the data (after the median) = [6, 14]
The median of [6, 14] is 10

19
Q

what is IQR

A

Interquartile Range
QR = Q3 − Q1
→ This gives us the range of the middle 50% of the data
In our example:
𝐼𝑄𝑅=10−3=7
It tells you how spread out the central portion of the data is.

20
Q

how can we find outliers (valeurs aberrantes) with quartiles?

A

We use the IQR rule:
Lower Bound = Q1 − 1.5 × IQR = 3 − 1.5×7 = −7.5
Upper Bound = Q3 + 1.5 × IQR = 10 + 10.5 = 20.5

Any value outside this range is an outlier.

21
Q

Histogram vS Bar Chart (data type; bars touch?; purpose)

A

Histogram vS Bar Chart:
Data Type: Interval / numerical vS Nominal / categorical
X-axis: Continuous scale (numbers) vS Discrete categories (labels)
Bars Touch?: Yes (shows continuity) vS No (categories are separate)
Purpose: Show distribution of values vS Compare category frequencies

22
Q

what is the Kernel Density Plot

A

A smooth curve that shows the estimated distribution of a variable (smoothed version of a histogram)
-> Helps visualize where values are concentrated over a range

📍Why use it?
It removes the “bin” problem in histograms
Better at showing the shape of the distribution, especially for large data sets

23
Q

what is the “bin” pb of histograms

A

A histogram groups data into bins (intervals). Each bin covers a range of values, and the height of the bar shows how many values fall into that range.
-> problem:
*Binning is arbitrary — You choose how wide the bins are (e.g., 0–5, 6–10… or 0–10, 11–20…).
*Different bin widths can change how the data looks — they can:
*Hide important patterns (if bins are too wide)
*Exaggerate randomness or noise (if bins are too narrow)
*Make a skewed distribution look symmetric — or vice versa

24
Q

how are mean and median with outliners

A

median: Resistant to outliers (not affected much by extreme values)

Mean: Sensitive to outliers — one very high or very low value can pull it up or down

25
what is skewness
Skewness tells us whether a distribution is symmetrical or lopsided Skew Type: *symmetrical 1.shape: Balanced 2.relationship: Mean ≈ Median 3.ex: Test scores in a fair class *Positive Skew (Right skew) 1.shape: Tail stretches to the right 2.relationship: Mean > Median 3.ex: Income (a few billionaires pull the mean up) *negative Skew (left skew): 1.shape: Tail stretches to the left 2.relationship: Mean < Median 3.ex: Age at retirement (most people retire at 60–65, but some much earlier)
26
summary statistics def
Summary statistics are numbers that give you a quick overview of a set of data. They summarize important features like the average value, how spread out the data is, and what the overall shape looks like.
27
mean/mode/median for which variable
-nominal variables: mode (bcs we can only count how often a category appears) -ordinal variables: median and sometimes mode (bcs you can rank the data, but the difference between levels isn’t always equal, so mean isn't reliable) -interval/ ratio variables: mean (and also standard deviation and median)
28
why to use an histogram; a density plot; a boxplot; a bart chart; a line chart; cross-tabulation
Histogram: Shows how a continuous variable is distributed by grouping data into bins. Density Plot: Smooth curve to visualize the probability distribution of a continuous variable. Boxplot: Summarizes the spread and identifies outliers in a numeric variable. Bar Chart: Compares frequencies or values across categories. Line Chart: Shows how a variable changes across time or an ordered scale. cross-tabulation shows how cases are distributed across categories of a dependent variable (DV) for different values of an independent variable (IV).