Basics Flashcards
(127 cards)
Summation / Sigma Notation
This is the sigma symbol: ∑
It tells us that we are summing something.
n is the summation index - When evaluating the expression we substitute different values for the index.
Frequency tables - lists - dot plots
Used to represent a single variable.
A list is just a list of variable values
A Frequency Table is a table showing each value and how often it occurs
A dot plot is a visual frequency table, with the variable value on the x and the frequency on the y.
These are all ways of representing the same info.
Once the data is organized we can start to analyze it with summary stats etc.
Histogram
Used to represent a single variable
Like a bar chart but both the x an y axis are numerical.
x-axis = intervals
y-axis = absolute frequency of each interval.
The bars will be touching to show that one interval beings where the other ends.
Instead of just plotting out the frequency of each discrete value, like a frequency table or dot plot, a histogram arranged the data in categories and then shows how many values fall within the category. The categories are often called buckets or bins.
Bins should not overlap
Descriptive statistics
Ways of describing data without just providing the raw data. It’s about describing the data with a smaller set of numbers.
This would include thinks like summary statistics.
Inferential Statistics
Ways of gaining insight from the data set and figuring out what the data means. How can we use the data to understand what the population value might be.
The key to inferential statistics is understanding that samples do not always accurately reflect the population they came from.
A large part of inferential statistics is quantifying our uncertainty about a population by looking at a smaller sample.
Average/ Central Tendency
Average = Typical or middle value of a data set. The “central tendency” of the data
Common types:
Mean
Median
Mode
The ‘best’ measure of central tendency will depend on which measure best represents the actual data and how it is skewed (or not).
All measures should be used in combination to understand the data set.
Median
The middle number of the data set when the set is placed in numerical order. If there is an even number of values, you take the mean of the two middle numbers.
The median is useful if there are outliers that will skew the mean and make it misleading.
Mode
The most common number in the data set. If there is no most common number then there is no mode.
Typically the least used measure of central tendency
The mode is useful if there are outliers that skew the mean and if there is a single number that shows up a lot.
Location of central tendency and skewness
In symmetrical distributions the mean, median and mode are identical or very close.
In left skewed distributions the mean is typically to the left of the median, which is to the left of the mode.
In right skewed distributions the mean is typically to the right of the median, which is to the left of the mode.
Left and right skew
A left skew means the tail/ outliers are to the left
A right skew means the tail/ outliers are on the right.
Interquartile Range (IQR)
The IQR is a measure of how spread out the data is.
(IQR) is the distance between the first and third quartile marks (25th to 75th percentile).
The IQR is a measurement of the variability about the median.
IQR tells us the range of the middle half of the data.
To find the IQR:
1. Find the median of the data set
2. Find the median of each set of numbers on either side of the median number. The IQR is the difference between these two numbers.
Outliers
The definition of what is reasonably an outlier is subject to some interpretation based on the specific qualities of the data set.
Common definition:
An outlier is any number that is more than 1.5x the interquartile range below Q1 or above Q3
Sample Mean
Calculated the same way as the population mean
Measures of Variability - Univariate
Variance
Standard Deviation
Coefficient of Variation
Sample Variance
Variance is a measure of the spread or dispersion of a set of data points around their mean. It quantifies how much the individual data points deviate from the average.
Sample variance is generally a pretty good statistic in terms of approximating the true variance of the population.
A better approximation of the population parameter can usually be gained by dividing by n-1.
This approximation is AKA ‘The unbiased sample variance’.
Dividing by just n will tend to underestimate the population variance.
Dividing by n is fine if you just want the varianve/SD of the sample itself.
S2 or s squared symbol
s² = ∑(x - x̄)² / (n - 1)
Using n-1 instead of n is AKA Bessels Correction
Standard Deviation
Measure of the dispersion or spread of a set of data points around their mean. It is closely related to variance but is expressed in the same units as the original data, making it easier to interpret and compare.
The standard deviation is the square root of:
The population variance
OR
The unbiased sample variance (S^2/ n-1)
The square root of the sample variance (AKA the sample standard deviation) will not be an unbiased approximation of the population standard deviation.
This is because the square root function is non-linear.
SD is Written as:
s
std(x) - SD of random variable x
lowercase sigma
Variance - Interpretation
Variance is always non-negative
Interpreting variance involves considering the magnitude of the variance value and its relationship to the data set.
Consider:
Magnitude - Higher variance means more dispersion from the mean
Units - Variance is expressed in squared units of the original data. Restore the original units by converting to standard deviation
Comparison - If one data set has a significantly higher variance than another, it implies that the observations in the first data set are more widely scattered.
Outliers - Variance is sensitive to outliers, they can inflate the variance, making it a less reliable measure of dispersion
Limitations:
Variance does not tell us the direction of variations from the mean.
It treats positive and negative differences equally.
Not robust in non-normal - heavily skewed data sets.
Coefficient of Variation(CV)
AKA relative standard deviation.
Calculated by standard deviation divided by the mean value. It’s just the standard deviation relative to the mean.
There are separate formulas for population and sample data for this measurement as well.
CV is used to compare the variation of two different data sets.
It will return a number that is not in units and is directly comparable across data sets.
Standard Deviation - Interpretation
Better for comparisons between datasets because the units are normalized.
Quantifies the typical amount of variation or “typical distance” of data points from the average.
Consider:
Magnitude - Higher SD means data points are spread farther from the mean.
Units - SD is expressed in the same units as the original data. This makes interpretation and comparison easier.
Range - SD provides a useful range around the mean (68-95-99.7). It helps us visualize where the data is falling using a single number.
Comparison - Comparing the standard deviations of different data sets allows you to assess their relative spread.
Outliers - SD is sensitive to outliers. Outliers, which are extreme values, can have a significant impact on the standard deviation.
Limitations - SD assumes a normal or symmetrical distribution. Is the data has a heavy skew - other measures might be more appropriate
Mean vs. Median as Central Tendency
The measures work in pairs:
More symmetrical Data:
Mean = central tendency
Standard Deviation = Spread
More Skewed Data:
Median = central tendency
IQR = Spread
Outlier values will move the mean quite a lot but they don’t effect the median, which depends on the sample size, not the actual data point values.
Z-scores
One of the most common measures in statistics.
A Z-score tells you how many standard deviations away from the population mean a given data point is.
This helps you tell how usual or unusual a data point is.
This can be useful for comparing data points from different distributions. The scales are different but the relative position to the mean can still be compared.
To calculate the Z-score for a data point x:
(x-µ) / standard deviation
Z-scores - Interpretation
The Z-score of a data point tells you how many standard deviations from the mean the point is.
A negative Z-score indicates that the data point is below the mean, a positive Z-score indicates that the data point is above the mean
A z-score of 0 means the data point is equal to the mean.
A z-score of 1 means the data point is one standard deviation above or below the mean.
A z-score of 2 indicates it is two standard deviations away, and so on.
Typically, data points with z-scores greater than 3 or less than -3 are considered extreme outliers
You can use a table to find the percentile of a data point given it’s z-score
Empirical Rule and Normal Distributions
The Empirical Rule is AKA the 68-95-99.7 Rule
Marginal Distribution
The distribution formed by the totals of a single variable in a two way table. This data can be represented as numbers or as percentages.
Look at the margins of the table.