Stats Flashcards
(37 cards)
Nominal data
are classified into mutually exclusive groups or categories and lack intrinsic order. A zoning classification, social security number, and sex are examples of nominal data. The label of the categories does not matter and should not imply any order. So, even if one category might be labeled as 1 and the other as 2, those labels can be switched.
Ordinal data
are ordered categories implying a ranking of the observations. Even though ordinal data may be given numerical values, such as 1, 2, 3, 4, the values themselves are meaningless, only the rank counts. So, even though one might be tempted to infer that 4 is twice 2, this is not correct. Examples of ordinal data are letter grades, suitability for development, and response scales on a survey (e.g., 1 through 5).
Interval data
is data that has an ordered relationship where the difference between the scales has a meaningful interpretation. The typical example of interval data is temperature, where the difference between 40 and 30 degrees is the same as between 30 and 20 degrees, but 20 degrees is not twice as cold as 40 degrees.
Ratio data
is the gold standard of measurement, where both absolute and relative differences have a meaning. The classic example of ratio data is a distance measure, where the difference between 40 and 30 miles is the same as the difference between 30 and 20 miles, and in addition, 40 miles is twice as far as 20 miles.
Continuous variables
can take an infinite number of values, both positive and negative, and with as fine a degree of precision as desired. Most measurements in the physical sciences yield continuous variables.
Discrete variables
can only take on a finite number of distinct values. An example is the count of the number of events, such as the number of accidents per month. Such counts cannot be negative, and only take on integer values, such as 1, 28, or 211.
binary or dichotomous variables
only take on two values, typically coded as 0 and 1.
Descriptive Statistics
describe the characteristics of the distribution of values in a population or in a sample. For example, a descriptive statistic such as the mean could be applied to the age distribution in the population of AICP exam takers, providing a summary measure of central tendency (e.g., “on average, AICP test takers in 2018 are 30 years old”).
Inferential Statistics
use probability theory to determine characteristics of a population based on observations made on a sample from that population. We infer things about the population based on what is observed in the sample. For example, we could take a sample of 25 test takers and use their average age to say something about the mean age of all the test takers.
Distribution
is the overall shape of all observed data. It can be listed as an ordered table, or graphically represented by a histogram or density plot. A histogram groups observations in bins represented as a bar chart. A density plot is a smooth curve.
range
the difference between the largest and the smallest value.
Normal or Gaussian distribution
also referred to as the bell curve. This distribution is symmetric and has the additional property that the spread around the mean can be related to the proportion of observations. More specifically, 95% of the observations that follow a normal distribution are within two standard deviations from the mean
Symmetric distribution
is one where an equal number of observations are below and above the mean (e.g., this is the case for the normal distribution).
An asymmetric distribution
where there are either more observations below the mean or more above the mean is also called skewed.
Skewed to the right
when the bulk of the values are above the mean. This tends to happen when the distribution is dominated by a few very large values (outliers)
Skewed to the left
where small values (such as zero) pull the distribution to the left
Central tendency
is a typical or representative value for the distribution of observed values. There are several ways to measure central tendency, including mean, median, and mode. The central tendency can be applied to the population as a whole, or to a sample from the population. In a descriptive sense, it can be applied to any collection of data.
Mean
is the average of a distribution. It is computed by adding up the values and dividing by the number of observations.
Weighted mean
is when there is a greater importance placed on specific entries or when representative values are used for groups of observations. For example, when computing a measure for the mean income among a number of counties, the value for each county could be multiplied by the number of people of the county, yielding a population-weighted mean. The mean is appropriate for interval and ratio scaled data, but not for ordinal or nominal data.
Median
is the middle value of a ranked distribution. The median is the only suitable measure of central tendency for ordinal data, but it can also be applied to interval and ratio scale data after they are converted to ranked values.
Mode
is the most frequent number in a distribution. There can be more than one mode for distribution. For example, the modes of [1, 2, 3, 3, 5, 6, 7, 7] are 3 and 7. The mode is the only measure of central tendency that can be used for nominal data, but it can also be applied to interval and ratio scale data.
Standard deviation
Square root of the variance. The standard deviation is in the same units as the original variable and is therefore often preferred.
Variance
the average squared deviation from the mean. A larger variance means a greater spread around the mean (flatter distribution), a smaller variance a narrower spread (a spikier distribution).
Coefficient of Variation
Which measures the relative dispersion from the mean by taking the standard deviation and dividing by the mean.