Flashcards in Chapter 2 - Descriptive Statistics Deck (22):

1

## John Tukey

###
- 1915 - 2000

- exploratory data analysis (EDA) = boxplots, stem-and-leaf plots

- coined terms such as bit and software

2

## Features of a good numeric or graphic form of data submission

###
- self-contained

- understandable without reading the text

- clearly labeled of attributes with well-defined terms

- indicate principal trends in data

3

## Measures of location

###
- also known as measures of central tendency

- data summarization is important before any inferences can be made

- measure of location is useful for data summarization that defines the center or middle of the sample

4

## Arithmetic mean limitation

###
- oversensitive to extreme values

- in which case, it may not be representative of the location of the majority of sample points

5

## Symmetric distribution

### arithmetic mean is approximately the same as the median

6

## Positively skewed distribution

###
- tail end is on the right side

- arithmetic mean tends to be larger than the median

7

## Negatively skewed distribution

###
- tail end is on the left side

- arithmetic mean tends to be smaller than the median

8

## Mode

###
- the most frequently occurring value among all the observations in a sample

- data distributions may have one or more modes (unimodal, bimodal, trimodal, etc.)

9

## Range

###
- the difference between the largest and smallest observations in a sample

- range is very sensitive to extreme observations or outliers

- larger the sample size n, the larger the range tends to be and the more difficult the comparison between ranges from data sets of varying sizes

10

## Quantiles or percentiles

###
- a better approach than range to quantifying the spread in data sets is percentiles or quantiles

- percentiles are less sensitive to outliers and are not greatly affected by the sample size

11

## Standard deviation

### standard deviation is a reasonable measure of spread if the distribution is bell-shaped

12

## Grouped data

###
- when sample size is too large to display all the raw data, data are frequently collected in grouped form

- the simplest way to display the data is to generate a frequency distribution using a statistical package

13

## Frequency distribution

###
- frequency distribution = ordered display of each value in a data set together with its frequency

- if the number of unique sample values is large, then a frequency distribution may still be too detailed

- if the data is too large, then the data is categorized into broader groups

14

## Types of grouped data

###
- bar graphs

- stem and leaf plots

- box and whisker plot

- scatter plot

- histogram

15

## Bar graphs

### - identity of the sample points within the respective groups is lost

16

## Stem and leaf plots

###
- easy to compute the median and other quantities

- each data point is converted into stem and leaf

- the collection of leaves indicates the shape of the data distribution

17

## Box and whisker plot

###
- uses the relationships among the median, upper quartile, and lower quartile to describe the skewness or symmetry of a distribution

- a vertical bar connects the upper quartile to the largest non-outlying value in the sample

- a vertical bar connects the lower quartile to the smallest non-outlying value in the sample

18

## Box and whisker plot (symmetric)

### - upper and lower quartiles should be approximately equally spaced from the median

19

## Box and whisker plot (positively skewed)

### - upper quartile is farther from the median than the lower quartile

20

## Box and whisker plot (negatively skewed)

### - lower quartile is farther from the median than the upper quartile

21

## Box and whisker plot (outlying value)

###
- x > upper quartile + 1.5 IQR

- x < lower quartile - 1.5 IQR

22