1. Descriptive Statistics Flashcards

(33 cards)

1
Q

variable def

A

An empirical measurement of a characteristic (something we want to observe or measure).
+ Every variable has a name and at least two values.

to find it:
-Name = what are we measuring? (e.g., “Country”, “Population”)
-Values = the actual data we observe for different cases (e.g., “Germany” or “83.2 million”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

def unit of analyse

A

who or what you are analyzing.
It’s the entity you want to describe using variables.

Examples:
-comparing countries → unit of analysis = country
-analyzing people → unit of analysis = individual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

def levels of measurment

A

how we can categorize or quantify variables, and they matter because different statistical methods apply depending on the level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are the levels of Measurement in Statistics

A

nominal-level variables; ordinal-level variables; interval-level variables; ratio-level variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

explain Nominal-level variables

A

📌 Least precise
What it tells us: Differences between units only (categories)
Examples: Country name, religion, political party, gender
Numeric codes: Used only as labels, not quantities (e.g., 1 = Germany, 2 = France)
Key properties:
-Categories only
-Mutually exclusive (each unit fits into one category)
-Exhaustive (all possible options are listed)
✅ NO order or ranking

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

ordinal-level variables

A

📌 More precise than nominal
What it tells us: Relative ranking between units of analysis
Examples: Level of democracy (low, medium, high), education level, military power rank
Numeric codes: Reflect order (e.g., 1 = low, 2 = medium, 3 = high)
Key properties:
-Categories with rank order
-But we don’t know the exact distance between them
✅ Ranking possible, but intervals are unknown

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

interval-level variables

A

📌 Even more precise
What it tells us: Exact differences between values
Examples: Temperature in Celsius, years (e.g., 1945, 2001), IQ scores
Key properties:
-Numeric values with equal spacing between them
-Can calculate differences (e.g., 2020 − 2000 = 20 years)
-❌ No true zero (0 doesn’t mean absence)
✅ Use of addition/subtraction valid, but not multiplication/division

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

ratio-level variable

A

📌 Most precise
What it tells us: Exact values, with a true zero
Examples: Population size, GDP, number of conflicts, age
Key properties:
-All features of interval-level
-True zero: 0 = none of the variable
-Can do all arithmetic operations (e.g., Country A has 2× the GDP of Country B)
✅ Discrete (whole numbers) or continuous (decimals)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Def descriptive statistics

A

help us summarize and understand datasets by describing two things: measures of central tendency and Measures of Dispersion (or Spread)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

measures of central tendency

A

tools used to find the center or typical value in a dataset
There are three main measures:
*mode (value that appears most often in a dataset ex if dataset: blue, blue, red then the mode is blue bcs it is the data that appares the more often)
*median (when the data is sorted in order ex: Dataset: 1, 3, 5, 8, 10
Median = 5 (2 values below, 2 values above) but if dataset: 1, 3, 5, 8; it is (3+5)/2)
*mean (Add up all values, then divide by the number of values)

!!!A dataset can have more than one mode (this is called bimodal or multimodal).
If all values occur the same number of times, there’s no mode!!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the frequency distribution of nominal variables

A

(ex: 30 married ppl out of 100)
A table that shows how often each category appears:
*Raw frequency = actual number of cases (e.g., 30 people are married)
*Total frequency = total number of all cases (ex 100)
*Proportion = raw frequency / total frequency (adds up to 1) (ex 30/100)
*Percentage = proportion × 100 (adds up to 100%) (ex 30/100 * 100=30%)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

def bar chart

A

only for nominal-level variables (or ordinal if you want just to compare)
X-axis = categories (labels/values)
Y-axis = frequencies or percentages
Height of each bar = how many cases fall into each category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are the 3 types of dispersion?

A

(Variation in responses)
*Greatest dispersion = all categories equally common
-> every category has the same nb of cases, there is no stronge category
*Least dispersion = all cases in one category
-> everyone picked the same category so minimum variation
*Variation ratio = proportion of cases not in the mode

Useful to gauge how spread out responses are
-> this is for ctegorical variables but for numerical, you use standard deviation to see how spread the data is)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what are the key measures of ordinal variables?
what are the 2 dispersions possible
what are the heuristics (mental shortcuts for solving problems in a quick way)?

A

*Median = 50th percentile (middle value)
*Mode = most frequent value
*Cumulative percentage = % at or below a given value
*Percentile = value below which a certain % of data falls

-Range = distance between lowest and highest categories
-Interquartile Range (IQR) = middle 50% spread

High dispersion = responses spread across categories
If mode and median are far apart, dispersion is likely high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what are the central tendencies of interval variable;
what are the skewness (asyméties)?
what is the dispersion?

A

Measures of Central Tendency:
Mean = average
Median = middle value
Mode = most frequent value

Skewness:
Positive skew (right): mean > median
Negative skew (left): mean < median

Dispersion:
Range = max - min
IQR = 75th percentile - 25th percentile
(Middle half of the data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what Measures of Dispersion for which variable level?

A

Nominal (only categories so none, maybe variation ratio)
ordinal (range or IQR)
interval (range, IQR, sd, variance)

17
Q

are measures of dispersion low or high according to level of the variables

A

*if One mode is clearly dominant for Nominal: low dispersion (If most of the data fall into a single category (e.g., 90% identify with one religion), there is little variation among cases. That means low dispersion — almost everyone is in the same group)

  • for ordinal, low dispersion when responses are clustered around the median rank (If most responses are near the same rank (e.g., mostly “satisfied” or “neutral”), the range of ranks is small, and the data is concentrated).

*Interval: low dispersion when values clustered around the mean (When values don’t stray far from the mean, the average distance (standard deviation) is small. This means the data points are close together = low dispersion)

18
Q

what abt the skew dep on level

A

nominal: no relationship
ordinal: can be + or -
interval: can be + or - AND quantifiable

19
Q

what is standard deviation?

A

Measures how much individual values deviate from the mean
It tells you how “spread out” the data is

Five Steps to Calculate Standard Deviation (sample):
*Step 1: Find each value’s deviation from the mean (value − mean)
*Step 2: Square each deviation (value − mean)²
*Step 3: Sum all the squared deviations
∑(value − mean)²
*Step 4: Find the variance:
Variance = sum / (n − 1)
(we divide by n − 1 for sample data to correct bias)
*Step 5: Take the square root of the variance
SD = √variance

EXAMPLE: Let’s say we surveyed 5 people and asked how many hours of political news they watch per week. Their responses were: [2, 4, 4, 6, 14]

s1: Find the Mean (average) mean= (2+4+4+6+14)/5=6
AND Find Deviations from the Mean:
2 − 6 = −4
4 − 6 = −2
4 − 6 = −2
6 − 6 = 0
14 − 6 = 8
-> derivation from the mean (value-mean)= -4; -2; -2; 0; 8
s2: square each deviation: (-4)²=16; (-2)²=4; 0; 8²=64
s3: sum of the squared deviations: 16+4+0+4+64=88
s4: find the variance: sum/(n-1) so 88/ (5-1)= 22
s5: SD= √22 =4.69
-> so tells us how much, on average, the data points deviate from the mean (on average, people watch about 4.69 hours more or less than the mean of 6 hours).

20
Q

what is a Boxplot (Box-and-Whiskers Plot)
explained step by step in the next cards

A

A compact summary of the distribution of a variable
-> to create a boxplot, we need:
*Q1 (25th percentile) = 4
Q2 (Median) = 4
Q3 (75th percentile) = 6

*IQR = Q3 − Q1 = 6 − 4 = 2

*Now calculate adjacent values (whiskers):
Lower adjacent value = Q1 − 1.5 × IQR = 4 − 3 = 1
Upper adjacent value = Q3 + 1.5 × IQR = 6 + 3 = 9
👉 Since 14 is above 9, it is an outlier.

*Boxplot Summary:
Min (non-outlier): 2
Q1: 4
Q2 (Median): 4
Q3: 6
Max (non-outlier): 6
Outlier: 14

21
Q

what are quartiles

A

they help describe how a data set is divided into four equal parts when sorted in order. They are especially useful in identifying the spread of the data and spotting outliers.

*Q1 (First Quartile / 25th percentile)
It marks the point below which 25% of the data falls.
*Q2 (Second Quartile / Median / 50th percentile)
This is the middle value of your data.
It splits the dataset in half — 50% of the values are below it, 50% above.
*Q3 (Third Quartile / 75th percentile)
This is the median of the upper half of your data.
It marks the point below which 75% of the data falls.

=> in the example surveyed 5 people and asked how many hours of political news they watch per week. Their responses were: [2, 4, 4, 6, 14]

*Q1:Lower half of the data (before the median) = [2, 4]
The median of [2, 4] is 3
*Q2: Since there are 5 values, the middle one is the 3rd: Q2=4
*Q3: Upper half of the data (after the median) = [6, 14]
The median of [6, 14] is 10

22
Q

what is IQR

A

Interquartile Range
QR = Q3 − Q1
→ This gives us the range of the middle 50% of the data
In our example:
𝐼𝑄𝑅=10−3=7
It tells you how spread out the central portion of the data is.

23
Q

how can we find outliers (valeurs aberrantes) with quartiles?

A

We use the IQR rule:
Lower Bound = Q1 − 1.5 × IQR = 3 − 1.5×7 = −7.5
Upper Bound = Q3 + 1.5 × IQR = 10 + 10.5 = 20.5

Any value outside this range is an outlier.

24
Q

Histogram vS Bar Chart (data type; bars touch?; purpose)

A

Histogram vS Bar Chart:
Data Type: numerical vS categorical
X-axis: Continuous scale (numbers) vS Discrete categories (labels)
Bars Touch?: Yes (shows continuity) vS No (categories are separate)
Purpose: Show distribution of values vS Compare category frequencies

25
what is the Kernel Density Plot
A smooth curve that shows the estimated distribution of a variable (smoothed version of a histogram) -> Helps visualize where values are concentrated over a range 📍Why use it? It removes the “bin” problem in histograms Better at showing the shape of the distribution, especially for large data sets
26
what is the "bin" pb of histograms
A histogram groups data into bins (intervals). Each bin covers a range of values, and the height of the bar shows how many values fall into that range. -> problem: *Binning is arbitrary — You choose how wide the bins are (e.g., 0–5, 6–10… or 0–10, 11–20…). *Different bin widths can change how the data looks — they can: *Hide important patterns (if bins are too wide) *Exaggerate randomness or noise (if bins are too narrow) *Make a skewed distribution look symmetric — or vice versa
27
how are mean and median with outliners
median: Resistant to outliers (not affected much by extreme values) Mean: Sensitive to outliers — one very high or very low value can pull it up or down
28
what is skewness
Skewness tells us whether a distribution is symmetrical or lopsided Skew Type: *symmetrical 1.shape: Balanced 2.relationship: Mean ≈ Median 3.ex: Test scores in a fair class *Positive Skew (Right skew) 1.shape: Tail stretches to the right 2.relationship: Mean > Median 3.ex: Income (a few billionaires pull the mean up) *negative Skew (left skew): 1.shape: Tail stretches to the left 2.relationship: Mean < Median 3.ex: Age at retirement (most people retire at 60–65, but some much earlier)
29
summary statistics def
Summary statistics are numbers that give you a quick overview of a set of data. They summarize important features like the average value, how spread out the data is, and what the overall shape looks like.
30
mean/mode/median for which variable
-nominal variables: mode (bcs we can only count how often a category appears) -ordinal variables: median and sometimes mode (bcs you can rank the data, but the difference between levels isn’t always equal, so mean isn't reliable) -interval/ ratio variables: mean (and also standard deviation and median)
31
why to use an histogram; a density plot; a boxplot; a bart chart; cross-tabulation
Histogram: Shows how a continuous variable is distributed by grouping data into bins. Density Plot: Smooth curve to visualize the probability distribution of a continuous variable. Boxplot: Summarizes the spread and identifies outliers in a numeric variable. Bar Chart: Compares frequencies or values across categories. cross-tabulation shows how cases are distributed across categories of a dependent variable (DV) for different values of an independent variable (IV).
32
def population variance
a measure of how spread out the values in an entire population are from the population mean. It tells you how much the individual data points deviate (on average) from the mean squared, and it is denoted by the symbol σ² (sigma squared).
33
levels of measurment for which level of variable
1/nominal variables: -frequency distribution -bar chart -dispersion (possible but not common for nominal variables: variation ratio) 2/ordinal variables -cumulative percentage (total of percentages across the categories of a variable, adding up from the lowest to the highest category) ex: response level - frequency - % of respondents - cumulative %: .Very Low Trust 50 10% 10% .Low Trust 100 20% 30% (10+20) ... -percentile -dispersion: range and IQR -bar chart -bimodal distribution: a frequency distribution having two different values that are heavily populated with cases -unimodal distribution: only one distribution and one clear mode -mode or median 3/interval variables: -mode, median -skewness -measures of dispersion: IQR and range (maximum actual value minus the minimal actual value) -box plot -density plot -histogram 4/