section 2.2: considering categorical data Flashcards
(35 cards)
what is a contingency table?
a table that summarizes data for two categorical variables
what is a bar plot?
common way to display a single categorical variable
what is a relative-frequency bar plot?
a bar plot where proportions instead of frequencies are shown
how are bar plots different than histograms?
Bar plots are used for displaying distributions of categorical variables, while
histograms are used for numerical variables. The x-axis in a histogram is a
number line, hence the order of the bars cannot be changed, while in a bar plot
the categories can be listed in any order
what is variance?
the standard deviation squared
what is the equation for variance?
s^2 = (sum of(x - x̄)^2)/(n-1)
what points make a larger difference in variance?
points that are far away from the mean
Why do we use the squared deviation in the calculation of variance?
To get rid of negatives so that observations equally distant from the mean are weighed equally.
To weigh larger deviations more heavily.
what is standard deviation?
the square root of the variance, and has the
same units as the data
what is the median?
the value that splits the data in half when ordered in ascending order
what is the 50th percentile?
the median
what is the 25th percentile?
the first quartile, Q1
what is the 75th percentile?
the third quartile, Q3
what is interquartile range (IQR)?
where the middle 50% of the data is
what is the equation for IQR?
IQR = Q3 - Q1
what does the box in a box plot represent?
represents the middle 50% of the data, and
the thick line in the box is the median
what is the max upper whisker reach of a box plot?
Q3 + 1.5 x IQR
what is the max lower whisker reach of a box plot?
Q1 - 1.5 x IQR
what is an outlier?
observation beyond the maximum reach of the whiskers
why is it important to look for outliers?
Identify extreme skew in the distribution.
Identify data collection and entry errors.
Provide insight into interesting features of the data.
what are the robust statistics?
median and IQR
what are the non-robust statistics?
mean, variance (standard deviation)
for skewed distributions it is often more helpful to use ___________ to describe the center and spread
median and IQR
for symmetric distributions it is often more helpful to use __________ to describe the center and spread
the mean and SD