BUSI344 / CHAPTER 6 QUESTIONS 2 Flashcards
(36 cards)
FOUR MEASURES OF GOODNESS OF FIT
They are the coefficient of determination (R2), the standard error of the estimate (SEE), the coefficient of variation (COV), and the F-Statistic. In different ways, each indicates how well the equation succeeds in predicting sales prices and minimizing errors.
TWO MEASURES THAT RELATE TO THE IMPORTANCE OF INDIVIDUAL VARIABLES
The correlation coefficient (r) and the t-statistic, relate to the importance of individual variables in the model.
COEEFICIENT OF DETERMINATION
R2 measures how much of the variability in the dependent variable (sale price) is accounted for (or explained) by the regression line.
That is, essentially, how good are the estimates of selling price based on this expression involving total square footage of living area.
POSSIBLE VALUES OF R2
Possible values of R2 range from 0 to 1. When R2 = 0, none of the variation in sales prices is explained by the model. On the other hand, when R2 = 1, all deviations from the average sale price are explained by the regression equation and the sum of the squared errors equals 0. In a one-variable model, this implies that all sales prices lie on a straight line.
R2 STATISTIC MEASURES . . . . .
The R2 statistic measures the percentage of variation in the dependent variable (sale price) explained by the independent variable (living area).
INTERPRETATION OF R2 - EXAMPLE
If the R2 is 0.59, this means that the regression line is able to explain about 60% of the variation of the sales prices (“variation” refers to the squared differences between sales prices and the average sale price). In practice, this can be loosely interpreted to mean total living area accounts for about 60% of the purchaser’s decision to buy a specific condo. Or, conversely, total living area determines 60% of the selling price set by the vendor, while 40% is explained by other characteristics or by random variations in price. These two statements make intuitive sense at the very least - an important result, as common sense is a key factor in analyzing regression results!
R2 HAS TWO SHORT COMINGS
The use of R2 has two shortcomings. First, as we add more regression variables, R2 can only increase or stay the same, which can overstate goodness of fit when insignificant variables are included or the number of variables is large relative to the number of sales.
The second shortcoming of R2 (shared also by R2 ) is more a matter of care in interpretation. There can be no specified universal critical value of R2; i.e., you cannot say “acceptable results have an R2 of 85 %” or any other value. The critical value of the R2 statistic will vary with several factors and there are several non-mathematical reasons for variations in R2 which make setting a specific target for this statistic inadvisable.
ADJUSTED R2
R2 can be adjusted to account for the number of independent variables, resulting in its sister statistic, adjusted R2 or R2 . In the present example, the addition of number of windows as a nineteenth variable will cause adjusted R2 to fall unless the variable makes some minimum contribution to the predictive power of the equation.
WHAT’S MORE IMPORTANT IN REGRESSION MODELS?
In general in regression models, improving the standard error and COV is more important than increasing the adjusted R2, but you should generally try to have the adjusted R2 as high as possible and the standard error and COV as low as possible.
MEASURES THE DIFFERENCE BETWEEN REGRESSION LINE AND ACTUAL OBSERVATIONS
The standard error of the estimate (SEE) is one measure of how good the best fit is, in terms of how large the differences are between the regression line and the actual sample observations.
THE SEE MEASURES . . . .
The SEE measures the amount of deviation between actual and predicted sales prices.’
DISTRIBUTION OF REGRESSION ERRORS
In our example, we found an SEE of $9,556.24. Note that whereas R2 is a percentage figure, the SEE is a dollar figure if the dependent variable is price. Similar to the standard deviation discussion in Lesson 1, assuming the regression errors are normally distributed, approximately 68% of the errors will be $9,556 or less and approximately 95 % will be $19,112 or less (see Figure 2.1 in Lesson 2).
NOTE ON R2
In mass appraisal, we often divide properties into sub-groups and develop separate model equations for each, e.g., for each neighbourhood separately.
This reduces the variance among sales prices in sub-group and therefore we should not expect MRA to explain as large a percentage as when one equation is fit to the entire jurisdiction. For example, if one model is developed to estimate sale price for all neighbourhoods in a sales database, there may be $300,000 in variation among the sales prices.
A model that explains 80% of the variation, still leaves 20% or $60,000 unexplained. A model for a single neighbourhood, with only $50,000 variation in sale price may have an adjusted R2 of only 60%, but will produce better estimates of sales prices in that neighbourhood because 40% of the variation is only $20,000. The standard error and COV (discussed later) will show this improvement.
PROBLEM WITH USING SEE
The problem with SEE is that it is an absolute measure, meaning its size alone does not tell you much in itself, and thus it can only be used in comparison to other similar models. However, you can create a further statistic from it that tells you how well you are doing in relative terms in your particular model. By dividing the SEE by the mean of the dependent variable, you get a relative measure called the coefficient of variation or COV.
EXPRESSING SEE AS A PERCENTAGE
In our example, the SEE is $9,556. This would indicate a good predictive model when mean property values are high, but not when they are low. Expressing the SEE as a percentage of the mean sale price removes this source of confusion.
COEFFICIENT OF VARIATION IS . . .
In regression analysis, the coefficient of variation (COV) is the SEE expressed as a percentage of the mean sale price and multiplied by 100.
INTERPRETING THE COV
The COV is calculated by dividing the SEE (9,556.24) by the mean of the sale prices (76,593.50), yielding 12.48%. In general, for residential models which have sale price as the dependent variable, a COV of approximately 20% is acceptable, while a COV of approximately 10% indicates a very good result. At 12.5%, our model’s COV is acceptably small, but not fantastic. This tells us that total square footage of living area does a fairly good job of predicting sale price, but there is more to sale price than just this one variable (as we would expect!).
THE CORRELATION COEFFICIENT
The correlation coefficient (r) is the first of two statistics that relate to individual regression variables. As explained in Lesson 1, the correlation coefficient is a measure that indicates the strength of the relationship between two variables. It can take on values from -1.0 to +1.0, ranging from very strong negative correlation to very strong positive correlation, or somewhere in between.
SIZE OF SEE
If the SEE is small, the observations are tightly scattered around the regression line. If the SEE is large, the observations are widely scattered around the regression line. The smaller the standard error, the better the fit.
THE CORRELATION COEFFICIENT MEASURES
The correlation coefficient measures how strongly two variables have a straight line relation to each other, but does not give the exact relationship. Two sets of data (x,y) yielding exactly the same regression equation (straight line) may have very different correlation coefficients between x and y.
REGRESSION COEFFICIENTS INDICATE . . .
Regression coefficients, indicate how variables are related; that is, how many units (dollars) the dependent variable changes when the independent variable changes by one unit (for example, one square foot) with other variables in the equation held constant.
T-STATISTIC IS A MEASURE OF . . .
The t-statistic is a measure of the significance or importance of a regression variable in explaining differences in the dependent variable (sale price).
WHAT IS CONSIDERED A HIGH T VALUE?
Generally, if you have plenty of data and want to have a statistical confidence of 95 % in your answer, the critical value that the t-statistic must exceed is +1.96. A t-statistic in excess of ± 2.58 indicates that one can be 99% confident that the independent variable is significant in the prediction of sale price.
T STATISTIC RULES OF THUMB
As a rough rule-of-thumb, modelers often use critical levels of t-statistic over 1.6 (90 % confidence) or 2.0 (95 % confidence).
A significance level of .10 suggests that one can be at least 90% confident that the variable coefficient is significantly different from 0 - or, in other words, less than 10 % probability that the coefficient is equal to zero. If the probability is high that the coefficient is equal to zero, this would indicate that the variable provides no useful information to the model.
A significance level of less than .05 would indicate that the probability of the coefficient being equal to zero is 5 % or less, which indicates a reliable result. Normally in mass appraisal work, a significance level of less than .10 is desired, and often .05 or less.