Stats Flashcards
(33 cards)
What is an index?
A change compared to a base value expressed in %.
Eg. Today $ / 2006 $ (x100) = 140
Equivalent to a 40% increase over 2006
Compare/contrast percentage vs. Index
Percentages compare only to its own point of reference. Index standardize the point of reference.
Absolute vs. Relative frequencies
Absolute is exact numbers. Relative is numbers within a range. Eg. 10-20
Line graph vs. Histogram
Line for relative frequencies
Bar centered on center point of range for frequency, touching other bars
EQ: mean (what, how to calc)
Sum / # of entries
Gives average value
E+, then x,y key
EQ: weighted arithmetic mean
Sum of (variable x weight) / sum of weights
Gives average
Mean vs median
Mean higher than median means bulk of data is lower and there is a right side tail (high potential outliers)
Mean lower than median means bulk of data is higher and there is a left side tail (low potential outliers)
When is mode useful?
Non-numeric data. Eg. Fav colours
EQ: Standard deviation
Looks like o with a tail
Value - mean, square it, sum them,divide by n, square root it
Gives how tightly clustered the values are. High = flat curve (loose)
Calc: sum values then ox, oy
Standard deviation intervals for a normal distribution
1 = 68.27%
2 = 95.45%
3 = 99.73%
EQ: Variance
Sum of (value - mean)^2 / n
OR std dev squared
Not very useful
EQ: coefficient of variation
SINGLE VARIABLE
= std dev / mean x 100
Shows how big the standard deviation is compared to the mean. Larger means more dispursed or more variable
BIVARIATE
=SEE / mean (dependent variable) x 100
tells us the size of the error as compared to the mean (%)
MULTIVARIATE
=same as bivariate
When is the median preferable to the mean?
- Open-ended frequency group
- Extreme values
EQ: Linear correlation coefficient (r)
Shows relationship between tow variables. Value from -1 to 1. Zero means no correlation. r > 0.8 strong, r < 0.4 weak.
BIVARIATE
Calc: x1 INPUT y1 [E+], [x^,r] ,SWAP
Cons: cannot describe non-linear relationships
Multivariate counterpart is coefficient of determination (R^2)
EQ: Sum of least squares (S)
BIVARIATE
(Actual - predicted)^2, then sum
Univariate would be variance before dividing by n
How to solve for bivariate linear equasion
x INPUT y [E+]
0 [y^,m] gives y-intercept
SWAP gives slope
EQ: Standard error of the estimates (Syx) or (SEE)
BIVARIATE is Syx
Equasion 11.7
Suggests prediction error. Is in the same unit as the variable. Lower is better
MULTIVARIATE
also known as Root Mean Square Error
Analogous to standard deviation of regression errors, same % for normal distribution (68.27%, 95.45%, 99.79%). Shows how scattered. Low = close
How do you turn a non-numerical variable into a number?
Turn each option into a yes(1)/no(0)
What are the four “goodness of fit” variables?
Coefficient of determination (R^2)
Standard error of the estimate (SEE)
Coefficient of variation (COV)
F-value
What are the two statistics that relate to the importance of individual variables?
Correlation coefficient (r)
t-statistic
EQ: coefficient of determination (R^2) and adjusted R^2
MULTIVARIATE
=correlation coefficient (r) squared
Shows how well the regression model explains the variation in the dependent variable in %. 0 (low) to 1 (high).
Weaknesses:
1. can only go higher with more variables added. Goodness of fit could be overstated by adding many insignificant variables. (Corrected by using adjusted R^2)
2. Every model is different, so no benchmarks for fit.
To improve multiple regression model, what should you look at?
First, SEE and COV, then R2
What are strata and what it the effect on R2?
Strata are groups made before modeling. Then each group gets a model. Eg. Neighbourhoods.
R2 may be less because a large part of the variation is removed already by the strata. What is left to be modeled forms the new basis for R2
EQ: f-value
(formula, meaning, benchmark, weakness)
= variance explained by regression divided by unexplained by regression
Is the model useful or no more useful than using the mean?
Tests whether model is NOT sufficient. F<4 = not significant
F>4=significant
Sensitive to number of variables/observations. High variables with low observations generally give f<4