Chapter 2 - Descriptive Statistics Flashcards Preview

Statistics > Chapter 2 - Descriptive Statistics > Flashcards

Flashcards in Chapter 2 - Descriptive Statistics Deck (22):

John Tukey

- 1915 - 2000
- exploratory data analysis (EDA) = boxplots, stem-and-leaf plots
- coined terms such as bit and software


Features of a good numeric or graphic form of data submission

- self-contained
- understandable without reading the text
- clearly labeled of attributes with well-defined terms
- indicate principal trends in data


Measures of location

- also known as measures of central tendency
- data summarization is important before any inferences can be made
- measure of location is useful for data summarization that defines the center or middle of the sample


Arithmetic mean limitation

- oversensitive to extreme values
- in which case, it may not be representative of the location of the majority of sample points


Symmetric distribution

arithmetic mean is approximately the same as the median


Positively skewed distribution

- tail end is on the right side
- arithmetic mean tends to be larger than the median


Negatively skewed distribution

- tail end is on the left side
- arithmetic mean tends to be smaller than the median



- the most frequently occurring value among all the observations in a sample
- data distributions may have one or more modes (unimodal, bimodal, trimodal, etc.)



- the difference between the largest and smallest observations in a sample
- range is very sensitive to extreme observations or outliers
- larger the sample size n, the larger the range tends to be and the more difficult the comparison between ranges from data sets of varying sizes


Quantiles or percentiles

- a better approach than range to quantifying the spread in data sets is percentiles or quantiles
- percentiles are less sensitive to outliers and are not greatly affected by the sample size


Standard deviation

standard deviation is a reasonable measure of spread if the distribution is bell-shaped


Grouped data

- when sample size is too large to display all the raw data, data are frequently collected in grouped form
- the simplest way to display the data is to generate a frequency distribution using a statistical package


Frequency distribution

- frequency distribution = ordered display of each value in a data set together with its frequency
- if the number of unique sample values is large, then a frequency distribution may still be too detailed
- if the data is too large, then the data is categorized into broader groups


Types of grouped data

- bar graphs
- stem and leaf plots
- box and whisker plot
- scatter plot
- histogram


Bar graphs

- identity of the sample points within the respective groups is lost


Stem and leaf plots

- easy to compute the median and other quantities
- each data point is converted into stem and leaf
- the collection of leaves indicates the shape of the data distribution


Box and whisker plot

- uses the relationships among the median, upper quartile, and lower quartile to describe the skewness or symmetry of a distribution
- a vertical bar connects the upper quartile to the largest non-outlying value in the sample
- a vertical bar connects the lower quartile to the smallest non-outlying value in the sample


Box and whisker plot (symmetric)

- upper and lower quartiles should be approximately equally spaced from the median


Box and whisker plot (positively skewed)

- upper quartile is farther from the median than the lower quartile


Box and whisker plot (negatively skewed)

- lower quartile is farther from the median than the upper quartile


Box and whisker plot (outlying value)

- x > upper quartile + 1.5 IQR
- x < lower quartile - 1.5 IQR


Box and whisker plot (extreme outlying value)

- x > upper quartile + 3.0 IQR
- x < lower quartile - 3.0 IQR