Chapter 3: Numerical descriptions of data Flashcards
mode
-most frequently appearing value, or common frequency class
- “humps in distribution”
- need not be all the same height or count
-mostly to recognize bi-modal or multi-modal
multi-modal
-often a tip-off that different types of individuals in the data set
mean or average
-useful when the data is roughly symmetrical and without many outliers
-can be misleading on very skewed data
median
- the halfway point, roughly half are smaller than this value and half are larger
-the measure of center is more resistant to skewness or outliers
-frequently used for distributions like income or house cost
-the value that lies in the middle of the data when the data set is ordered
median with odd number of entries
-median is the middle data entry
median with even number of entries
median is the mean of the two middle data entries
advantage of using the mean
-the mean is a reliable measure because it takes into account every entry of a data set
disadvantage of using the mean
-greatly affected by outliers
-if your data is skewed, then the mean is not the best way to measure the data- median is the best it is not effected by outliers
Range
- just maximum and minimum, easier to find by sorting data
-sensitive to outliers
-most common
Interquartile Range IQR
-range of middle half
-less sensitive to outliers
-the difference between the third and first quartiles
IQR = Q3-Q1
-Q1-1.5IQR (“Low Fence”)
-Q3 + 1.5IQR (“High
Fence”)
standard deviation
-appropriate for symmetric distributions where the mean is a good measure of center
-tells you “on average” how much EACH data value differs (varies) from the mean
-The greater the STANDARD DEVIATION the greater the SPREAD of data
canonical examples of variation
- the weights of ping pong balls: (b/c of standards and high quality manufacturing, very little spread)
- the weights of apples in a grocery store: some but no extreme variability
- your little brothers rock collection: a lot of variation
standard score (z-score)
-represents the number of standard deviations a given value x falls from the mean u
percentile
-ranks by 100ths
decile
-ranks by 10ths
quartile
-ranks by 4ths
quintiles
rank by 5ths
percentile
the kth percentile is the data value that has k% of the data at or below that value
Box and whisker plot
-exploratory data analysis tool
-highlight important features of a data set
- requires: minimum entry, first quartile, median, third quartile, maximum entry
-used to gage symmetry and spread of distribution
-multiple plots are convenient to compare two distributions
mean = median
“normal symmetric distribution”
-no outliers on both ends that out weigh one another
mean<median
(skewed left- small value outlier)
(think: skewed Left then the mean is Less than median)
mean>median
(Skewed Right – large value outlier)
Measures of variation are
- range
- deviation
- variance
- standard deviation
Measures of Center are
- mean
- median
- mode