Chapter 2 Flashcards
(25 cards)
Measures of central tendency
-finding typical value in dataset
-Mean(average)
-Medain (middle value)
-mode: (most frequent value)
Population mean (u)
Sample mean: x bar
Mean
Population mean (u)
Sample mean: x bar
-mean is highly sensitive to outliers (extremely high or low values)
Medain
Middle value of data set when ranked smallest-largest
-NOT sensitive to outliers, so use when data is extremely skewed
If data set is odd, the median is the value that’s in the middle
If data set is even, the median is the mean of the 2 middle values in the data set
Mode
-most frequently occurring value in data set
-can have multiple modes
-no number repeats = no mode
-most commonly used for categorical data
Right skewed distribution
-very large values (outliers) pull average up so tail is longer on right side
mode<medain<mean
-mean will always be greater than medain which will always be greater than mode
Left skewed distribution
-few extremely low values (outliers) pull average down
mean<medain<mode
-mean will be less than median
which will be less than mode
Bimodal distribution
Two peaks = modes (two values that occur most frequently)
Mean & medain between the peaks
Uniform distribution
mean = median (approximately)
Variability
Tells us how spread out or clustered values in a dataset are
- Range: difference between the largest and smallest values in a dataset
-highly sensitive to outliers
-doesn’t she how data is distributed, two datasets can have same range but look different
Variance
-Variance: measures how spread out the data points are from the mean, since we square the differences, variance is always positive
-small variance: data points=close to mean (low spread)
-large variance: data points = far from mean (high spread)
-population variance: for entire population
-sample variance: for a sample of the population
Standard deviation
-square root of variance, tell us how much data typically deviates from mean
small standard deviation: data points close to mean
large standard deviation: data points widely spread out
1-2-3 rule
-most observations are within one standard deviation of the mean (if you pick a random value, it’s likely to be pretty close to the mean)
-many values are within two standard deviations (less common, but still happens often)
-Almost all values are within three standard deviations (if a value is beyond this, it’s a possible outlier)
-tell us whether an observation is normal or unusual
Ex: average height of adult is 170 cm with a standard deviation of 10 cm
- Typical height within 1 standard deviation of the mean: most people are between 160 cm and 180 cm
- Less common heights within 2 standard deviations of the mean: some are between 150 cm and 190 cm
- Rare height within 3 standard deviations of the mean: a very small number of people are between 14 cm and 200 cm
Standard deviation
-measure spread around the mean
-tells us how spread out the data is
-values of variance and standard deviation are never negative
-highly sensitive to outliers (variance & standard deviation)
Z scores
-measure individual data points
-measures how far a single is from the mean in terms of standard deviations
-tells us how unusual value is
Z= 0 at the mean
Z=1 one standard deviation away (fairly normal)
Z>2 = unusual
Z>3 very rare
Z = x -mean/standard deviation
(Higher z score = more unusual)
Percentiles
-tells you where a particular value stands relative to the rest of the data
- the ath percentile is the number that has a% of the data below it
Ex: 90th percentile (P90) 90% of the data is below this value
25th percentile 25% of the data is below this value
Quartiles
- divide data into four parts
Q1= 25% of the data is below this value
Q2 = 50% of the data is below this value (medain)
Q3 = 75% of data is below the value
IQR
Tells us how spread out the middle 50% of the data is
IQR= Q3 -Q1
Resistant to outliers , focusses on the middle 50% and ignores extreme values
How to obtain quartiles
- Arranged, data, smallest or largest.
- Find medrain of data set [Q2)
-if odd number of observations include medain both halves
-if even number of observations do not include medain either half - Medain of top half = Q1
Medain of bottom half = Q3
-skewed data: use IQR for variability
-symmetric/normal data use standard deviation
Five numbers summary
-MIN
-Q1
-MEDAIN (Q2)
-Q3
-MAX
Outliers
-values that significantly differ from the rest of the data, either being much higher or much lower
-these value skew the results or make interpretations less accurate
-any data point below the lower fence or above the upper fence is an outlier
Adjacent values
-most extreme data points that are within the upper and lower limits, but are not considered outliers
-lower adjacent value: smallest number in the data set of lower fence
-Upper adjacent value: largest number in the data set of the upper fence
Box plot
-graphical representation of the five number summary, helps visualize the distribution central tendency and spread of data
-shows quartiles, potential outliers and medain