Ch.14, Descriptive Statistics Flashcards
Define descriptive stats
Describe data in ways rear give us a better idea of their charachteristics; Number that summarizes a set of data
NOT a correlation statistic (correlations are inferential)
What is the simplest measure of dispersion?
Range: take maximum — minimum
What are data matrices?
Putting data into a grid: a matrix
Opportunity to exam all data in one place
Histograms
graphical display of values where each bar indicates the frequency of the range or value
LIMITATIONS: The more “accessible” a data set is, the less information/less complexity you’re conveying
Advantages: identifies mode, helps to identify potential outliers
Binning
Binning in data mining is a data preprocessing technique that involves grouping data into smaller, more manageable categories or bins. It can be used for both numerical and categorical data and can help improve the efficiency and accuracy of data analysis.
Stem-Plots
both a graph and a chart that displays each score in a data set so that it visually represents the distribution/ frequency of scores
Stem: leading numbers
Leaves: trailing numbers
What does sigma mean and what it is its symbol?
Σ= sum of all scores
What does –
x. mean?
mean
Mean, advantages/disadvantages
Advantages: very common, takes into account every entry of a data set
Disadvantages: extremely influenced by outliers, knowledge about individual cases is completely lost with average
Population vs. sample mean
CAN NEVER REALLY KNOW THIS, whatever you’re trying to make a generalization about ; Population Mean: (mu greek symbol is the population mean) mean of the entire population (on charts)
Sample Mean: mean of your sample (on charts)
Median, advantages/disadvantages
Middle (from lowest to highest)
At the median half the data set is below that number and half the data set is above that number
Position of Median = number of entries + 1/ 2
Odd Number of Entries: median is the middle data entry
^^Even Number of Entries: median is the mean of the two middle data entries
Advantages: not influenced by outliers, reasonable estimate of what most people mean by the center of a distribution “reasonable” average salary in Canada not including billionaires
Disadvantages: may not be good to ignore extreme values in all cases;
Disadvantages, Advantages, Mode
LEAST USED, NOMINAL/CATEGORICAL VARIABLE Mode
Most frequently occurring; if there is no entry that is repeated there is no mode
Data can be bimodal, 3 OR MORE MODES= MULTI-MODAL)
Elections use this often to represent who said what party the most/ ask what most popular dish at a cafe
Advantages: most frequently obtained score which can be useful, not influenced by extreme scores and works when outliers aren’t relevant
Disadvantages: may not represent a large proportion of the scores, there’s still a bunch of answers that might be very frequent as well and it completely ignores those
Advantages/disadvantages of range
Range can never be negative: ALWAYS HAS TO BE ABSOLUTE VALUES
Advantages: includes all the data, simple,
Disadvantages: sensitive to small sample sizes, if you have a small sample of a broader population you wouldn’t get the full range in your small (small samples = less range), small samples = not a representative range, doesn’t tell you anything about where the bulk of the values are and is affected by outliers
What are interquartiles?
INTERQUARTILES SHOW DISPERSION AROUND MEDIAN
What is a quartile?
Quartiles: positions in a range of values representing multiples of 25%
What is the first and third quartile?
First Quartile: 25% of scores fall below the first quartile, 75% above (Q1: splitting bottom half in half)
Third Quartile: 75% of scores fall below the third quartile, 25% fall above (Q3: splitting top half in half)
What does the second quartile do?
Measure of distance between the first and third quartile (special kind of range that includes just the middle 50% of values) WHERE THE MIDDLE HALF OF THE DATA IS; TELLS YOU WHERE THE MIDDLE IS (25%, 30%, 30%, 25%) Interquartile Range would be 30%
Deviation Calculation
DEVIATION IS CALCULATING ONE SCORE’S DISTANCE FROM THE MEAN; STANDARD DEVIATION IS CALCDifference between each score and the mean of the data set
*Deviation shows dispersion around the MEAN rather than the median
Deviation of x (any given score) = x (that score) — x (the average)
(1) First, find the mean
(2) Then determine the deviation with the above formula
(3) Deviation scores all together should always sum to zero
Variance calculation
Isn’t usually reported because it’s not informative, not as usual as standard deviation: standard deviation is much more useful
Single number representing the average amount of variation in a set of scores
(1) Find mean of the data set
(2) Find deviation of each entry
(3) Square each deviation
(4) Add to get sum of squares
(5) Divide by n-1 to get the sample variance
Standard Deviation Calculation
Average of the deviations
Measure of the spread of scores out from the mean of sample
Calculate Variance then take square root of it
(1) Find mean of the data set
(2) Find deviation of each entry
(3) Square each deviation
(4) Add to get sum of squares
(5) Divide by n-1 to get the sample variance
(6) find the square root of the variance
What measures of central tendency do NOIR use?
Nominal: only use the mode
Ordinal: use mode and median (The mean cannot be computed with ordinal data. Finding the mean requires you to perform arithmetic operations like addition and division on the values in the data set. Since the differences between adjacent scores are unknown with ordinal data, these operations cannot be performed for meaningful results)
Interval: use mode, median, and mean
Ratio: use mode, median, and mean
What are confidence intervals?
the range of values that you expect your estimate to fall between a certain percentage of the time if you run your experiment again or re-sample the population in the same way. Range of values based on sample data likely to contain a true value
The confidence interval provides a sense of the size of any effect. The figures in a confidence interval are expressed in the descriptive statistic to which they apply (percentage, correlation, regression, etc.). This effect size information is missing when a test of significance is used on its own.
Skewed Distribution
distributions that are not normal; large amount of scores are clumped at either end
Left skewed distribution
majority of the numbers clumped at the right, with the long tail pointing toward the left