Descriptive Statistics(2) Flashcards
variable
variable: a population characteristic
which takes on different values for
the elements comprising the
population
define:
population
sample
parameter
statistic
population: the total set of elements (objects, persons, regions,
neighbourhoods, rivers, etc.) under examination in a particular study
sample: a subset of the elements in a population, which is used to make
inferences about certain characteristics of the population as a whole
parameter: a quantity that defines a certain characteristic of a population
- If you take the average of a POPULATION that is a PARAMETER
statistic: a quantity that defines a certain characteristic of a sample
two ways to present descriptive statistics-describe both
tables are great for summarizing large quantities of complex data, but
can be a challenge to read and interpret
-frequency tables
graphs can be used for simpler datasets, and are easily interpreted by
the reader
..Difference between bar graph and histogram
-bar graph is good for categorical(nominal/ordinal) data
-gaps are between in bar graph because they are distinct bars
-histogram has continuous data on the bottom, for interval and ratio data
-Boxplot is not used to often because it isn’t nice to look at and doesn’t convey much
natural break points
.natural break points: can be where frequency is 0 or low
Frequency Tables-5 important things to not when making one
- use intervals with simple bounds
- respect natural breakpoints
- the intervals must not overlap and must include all observations
- all intervals should be the same width
- select an appropriate number of classes
• this is hardest to determine
a histogram must have _____ breaks
equal
Line Graphs Vs Scatterplots
.Line is used for categorical data
-you CANNOT use the line to guess values, only for seeing pattern
.Scatter is used for interval/ratio with continuous data
The rose diagram is used for
o directional data has its own specific visual descriptive – the rose
diagram
-.By making intervals of wedges increase to the outside, it makes more sense as the wedges are bigger towards the outside
CENTRAL TENDENCY:
define each one below
midrange
mode
median
mean
o midrange: the midpoint between the largest and smallest values of a variable
in the data set
-the midrange is strongly affected by extreme values
o mode: the value of the most common/frequent value of a variable in the data
set
-what is the mode of a data set with no repeating values? No Mode
=the midrange and mode are crude statistics, and often do not provide an accurate
measure of centrality
o median: the value of a variable that divides the observations in half
o mean: the average value of a variable in the data set
Arithmetic Mean:
population mean symbol
sample mean symbol
u with extended vertical line at front
x with horizontal line overtop
Geometric Mean
notice that the
values are not evenly influenced(weighted differently) – we need a geometric mean
Arithmetric Vs. Geometric Mean
o the arithmetic mean is used when each data point has the same influence or
“weight” as all the other points
o the geometric mean is used when each data point has an associated frequency,
influence, or weight attached to it, such that some data points are more important
than others
in geometric mean the f(i) stands for the …
weight of the value
Ranges
ex: $0-10000
we need to start making assumptions – first, assume that the midpoint
of each range is a suitable option (is it always?)
.then we can determine the geometric mean
Which measure of central tendency is best?
o the centrality statistic should represent the typical value of the data set
o only the mean considers all of the values in the data set; the other statistics
only rely on specific values
o if you change any value in the set, the mean will also change
o usually, the mean is considered the best because of this property, but there
are some exceptions
Times when the mean is not reflective of the typical value:
1.o bimodal distributions
-the mean and median do not reflect
the typical value, but the modes do
- Prescence of extreme values that will highly effect the mean
- median and mode best - Skewed distributions
- the mode is most typical here,
while the mean is affected by the
stretch in the data set
How to evaluate the dispersion around the mean median and mode? 3
1.Range
the range is the difference between the largest and smallest data point
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 = 24 − 4 = 20
-the range only considers the 2 extreme values of the data set, which often
are not representative of the whole data set
-also, larger samples tend to have larger ranges, since they are more likely
to contain the rare or unusual members of a population
2.Percentiles & Quartiles
- recall that the median splits the data set in half and is known as the 50th
percentile (or 2nd quartile) – half of the data is above the median and half is
below
.to find the 25th percentile: (𝑛 + 1) 𝑃 = (6 + 1) 0.25 = 1.75
-the 25th percentile is the 1.75th value in the data set = 5.5
-25% of the data set is below 5.5, 75% of the data set is above 5.5
3.Variance
-the variance can be described as the sum of the squared differences of each
value from the mean
-becomes standard deviation when you take the solutions square root
-strongly affected by extreme values
Interquartile range
difference between the 1st and 3rd quartiles
_________ is a
better measure of dispersion as it omits those extreme
values
interqurtile range
Coefficient of Variation
if you want to compare the standard deviations of 2 data sets, you must
ensure that they have the same mean value (not always applicable)
.when that isnt possibe, coeffecient of variation is used to compare
-o a data set with a low CV is less variable than
one with a high CV
a data set with a low CV is _____ variable than
one with a high CV
less
Skewness
Positive
none
negative
the skewness of a data set describes how symmetrical the values are around
the mean, or the difference between the mean and median
.positive skewness the mean is greater than the median – the data set is asymmetrical
.no skewness
the mean is equal to the
median – the data set is
symmetrical
.negative skewness the mean is less than the median – the data set is asymmetrical
kurtosis
the kurtosis of a data set describes how peaked the data set is
.the data set is
relatively flat and
spread out
kurtosis < 3
mesokurtic the data set is relatively normally distributed kurtosis = 3
leptokurtic
the data set is narrow
and peaked
kurtosis > 3
Standardization
o it can be difficult to compare multiple data sets that each have different
means and standard deviations – to do this we have to standardize the data
o standardization translates the data set so that it has a mean of 0 and a
standard deviation of 1 – this allows you to compare multiple standardized
data sets easily
- the standardized values are known as z-scores – each z-score describes how
many standard deviations the value is from the mean