Lecture 17 ARM Flashcards
Visualising Anthropological Data - 4/7 (22 cards)
Inter-Quartile Range (IQR)
Definition: The range of the middle half of all the observations
Less influence of outliers - more robust
More information about the shape
Calculation: QR - Q1(75th percentile - 25th percentile)
Gets rid of the outer values
Eg important when studying inequality
Eg exam question: What is interquartile range used for?
Answer: To remove things like outliers when studying a community.
Variance 1(yap from kevin)
To understand how varied the data is - only used for interval/ratio - useful for normally distributed data.
Gives an indication of how far, on average, the values are from the central tendency (mean, usually).
Why do we need to know variance?
Provides a measure of variability, that takes into account the distance of each value from the mean. By squaring the deviations, variance gives more weight to larger differences - making it sensitive to outliers.
Often used in intermediate steps to get standard deviation.
Standard deviation
The square root of the variance.
Measures the average distance between each data point from the mean - but expressed in the same units as the original data. This makes it more interpretable /measurable for variability. ITPROVIDESACLEARINDICATIONOFHOWMUCHTHEVALUESDEVIATEFROMTHEMEAN.
Differs from IQR-Standard deviation ISinfluenced by outliers. Weakness or strength depending on if the outliers are measurement errors or meaningful data.
Large SD: data is widely scattered
Small SD:data clustered tightly around the mean.
Variance and Standard deviation for anthropology
Used in various subfields
Eg linguistics anthro - measure variability of lingustic structures
- quantify and compare variability in data - give more patterns, nuances etc
Variance 2 (slide)
Defintion: Refers to the spread of dispersion of data values in a set - how much they differ from each other
Calculation: Average of squared differences from the mean
If all the data points are almost the same - variation is low. If they vary widely, variation is high (doooiiininngng)
In everyday terms: The degree of difference among observations (also called VARIABILITYor SPREAD)
Anthro - finding differences and spread, it provides more context that average / centrality measures may overlook or hide.
Why variation matters in anthropology
- Reveals complexity - shows diversity within groups, showing that populations are not homogenous
- Cultural and individual differences - anthros study cultural complexity, eg variances in practices, beliefs
- Avoids misleading averages - focusing only on averages can hide important differences - eg inequalities or subgroup paterns - hence it provides context
Range
Definition: The simplest measure of variation - difference between the highest and lowest values in a dataset: Calculation of max value - min value
Sensitive to outliers!
Insight: Gives a quick sense of span, but does not tell us about distribution in between - can be affected by extreme values (outliers)
Variance and sd in the sample versus in the population
s = sample standard deviation
sˆ2 = sample variance
o = population sd
oˆ2 = population variance
Understanding Outliers
Definition: An unusual data point that lies far outside the range of the majority of the data - much higher or lower than the rest of the observation
Impact: can skew averages and distort analysis. - but also be important signals - maybe an error or a meaningful exception
Why visualize data?
- Reveal patterns: Display distribtution of data quickly (skewed, symmetric), peaks - (common values, eg unimodal, bimodal, multimodal), clusters and outliers
- Complement numbers: Averages and SD give summary - charts show detail - eg bimodality or skews
- Accessibility: For many people at a glance
- Anthropological insight: Helps spot cultural or biological patters (diversity, anomalies) that merit further inquiry (interdisciplinary audiences)
- Keep it clear: simple, direct, effective
Histogram
Definition: A histogram plots a numeric variable’s distribution as a series of bars (no middle space between the bars
x-axis: data value ranges (bins , eg 0- … any number)
y-axis: frequency (count of observations in each bar/bin
Purpose: reveal SHAPE of data / distribution
- where is the data concentrated
- spread; are valyes tightly clustered or widely dispersed
- skewness: longer tail on left or right? multiple peaks? outliers?
Anthropological uses
- Eg distribution of household sizes in villages…
Central tendency and variation in histogram
Central values show up as the tallest bar region (mode) - variability indicated by the width or spread of (range of bins with counts).
Examples of histograms
- Symmetric / unimodal
- Skew left
- Skew right
- Uniform
- Bimodal
- Multimodal
Common pitfall histograms
- Misreading bin ranges - each bar covers ARANGE - not a single value! (that would be a bar chart)
- Comparing histograms with different bin widths without caution (can distort perception of shape)
- Confusing histogram (numeric bins) with bar chart (categorical) - histograms are for continuous data
Boxplots
Also knows as the whisker-plot - shows median, quartiles and outliers
Box: Spans over the IQRaka Q3-Q1,middle 50% of the data
Median: Line inside the box (Q2-50th percentile)
Whiskers: Extend the data from 1.5 x IQRfrom the box (not beyond 2.7 SD)
Outliers: points beyond the box plot
See illustration
Purpose boxplot
Summarise distribution SHAPE and VARIABILITY in a compact form - great for comparing multiple groups side by side
- see median differences at a glance
- see IQR (box height)
- check symmetry vs skew (is median centered in box - are whisker lengths equal?)
- identify outliers - plotted as dots
Boxplots in anthropology
Compare distribution across categories
- Eg nutritional status by region, income by ethnic group etc
- conveys which groups tends higher or more variable
Common pitfalls boxplots
1) Confusing median with mean (mean is NOT shown)
2) Assuming whiskers = min and max values - the whiskers are only 1.5 x IQR - outliers are separate
3) Ignoring outlier points - they can important
4) Comparing boxplot heights without considering sample size - also small n can lead to misleading boxplots.
Scatterplot
Definition: Plot of individual data points on two axes (x vs y) - each point is one observation with a value for variable x and variable y (eg country and income)
Purpose: Reveal ASSOCIATION/CORRELATION between two quantitative variables
- Direction: positive correlation (upward trend), negative correlation (downward trend), or no correlation (no clear trend)
- Form: Linear, curved, clustered, or outliers influencing patterns
- Strength: How tightly do points cluster around a line or trend - tight = strong, scattered = weak
Anthropological use of scatterplots
examine a potential relationship - see if higher X tends to go with higher or lower Y
-Identify subgroups or anomalies
Central tendency and variation in scatterplots
- No single center shown - focus on co-variation instead of a variable’s mean
- Variation: the spread of points around any trend indicates how consistent or strongly correlated the relationship is. wide scatter = high variability and thus weak correlation
Common pitfalls scatterplots
- Correlation vs causation trap
- Overplotting in large datasets - too many can obscure patterns
- Ignoring non-linear patters
- Letting outliers distort perception