Regression Analysis
One of the most powerful and commonly used statistical tools. Has two primary purposes:
 Used to identify a mathematical relationship between a dependent variable and one or more independent variables.
 Regression analysis can be used to forecast the behavior of the dependent variable and/or to better understand the nature of the relationship between the dependent and the independent variable(s).
What is the regression line??
 the regression line is the line that minimizes the dispersion of points around that line, and we measure the accuracy of the regression line by measuring that dispersion.
 (on the graph below) We attribute the difference between the actual data points and the values predicted by the regression line either to relationships between selling price and variables other than house size or to chance alone.
Linear Regression
A specific form of regression analysis that examines the linear relationship between a dependent variable and one or more independent variables. Linear regression analysis identifies the “best fit line,” the line that minimizes the sum of squared error terms between the observed values in the sample and the predicted values that lie on the regression line. This bestfit line is called the regression line.
Single variable linear regression
Single variable linear regression can be seen as an extension of hypothesis testing. We have learned to use hypothesis tests to determine whether or not there is a significant relationship between two variables. Single variable linear regression seeks to identify a linear relationship between two variables
A single variable regression line can be described by the equation:
ŷ = a + bx
 ŷ is the expected value of y, the dependent variable, for a given value of x.
 a is the yintercept of the line the point at which the regression line intersects the vertical axis. This is the value of ŷ when the independent variable, x, is set equal to 0.
 b is the slope, the average change in the dependent variable y as the independent variable x increases by one.
 x is the independent variable, the variable we are using to help us predict or better understand the dependent variable.
Single Variable Linear Regression analysis is used to identify the best fit line between two variables. This analysis builds on two previous concepts we have used to study relationships between two variables:
 Scatter plots, which are useful for visualizing a relationship between two variables.
 The correlation coefficient, a value between 1 and 1 that measures the strength and direction (positive or negative) of the linear relationship between two variables.
As we learned in earlier in the course, we typically use Greek letters (like σ) to refer to the “true” parameters associated with a population and Latin letters (like s) to refer to the estimates of those parameters we calculate from sample data. Similarly, we refer to the best fit line we obtain from our sample data as ŷ =a+bx to distinguish it from ŷ =α+βx, the idealized equation that represents the “true” best fit line.
Because the best fit line does not perfectly fit even the population data, we add an error term, ε, to the true equation: y=α+βx+ε.
The error term is the difference between the actual value of y and the expected value of y. That is, ε=y−ŷ.
REVIEW How to:
 Make a Scatter Plot in Excel
 Find the Correlation Coefficient in Excel
Create a Scatter Plot in Excel
 Insert > Scatter > Scatter With Only Markers
 Input Y range (Ex: C1:C11)
 Input X range (Ex: B1:B11)
 Check the “Labels in First Box”
correlation coefficient
A measure of the strength of a linear relationship between two variables. The correlation coefficient can range from 1 to +1. A correlation coefficient of 1 indicates a perfect negative linear relationship between two variables, whereas a correlation coefficient of +1 indicates a perfect positive linear relationship. A correlation coefficient of 0 indicates that no linear relationship exists between two variables, though it is possible that a nonlinear relationship exists between the two variables.
250
Pick two points on the xaxis—let’s say 1,000 and 2,000—and see what the corresponding points are on the yaxis. According to the regression line, the expected selling price of a 1,000 square foot house is approximately $250,000, and for a 2,000 square foot house is approximately $500,000. Therefore, as house size increases by 1,000 square feet, price increases, on average, by approximately $250,000. To find the average change in price as house size increases by one square foot, we divide $250,000 by 1,000. We find that as house size increases by one square foot, price increases, on average, by approximately $250.
Option D
Since there is no obvious linear relationship between the variables, a line that is almost horizontal is most accurate. The line is positioned close to the average yvalue.
How to Add the Best Fit Line to a Scatter Plot
Create a Scatter Plot in Excel

Insert > Scatter > Scatter With Only Markers
 Input Y range (Ex: C1:C11)
 Input X range (Ex: B1:B11)
 Check the “Labels in First Box”

Insert > Chart Tools >Layout >Trendline

Check the Display Equation box to display the equation of the best fit line

Given the regression equation,
Selling Price= 13,490.45 + 255.36(HouseSize),
which of the following values represents the average change in selling price as house size increases by one square foot?
255.36
255.36 dollars/square foot is the line’s slope, which is equal to the average change in selling price as house size increases by one square foot.
Given the regression equation, Selling Price = 13,490.45 + 255.36(HouseSize),
what value represents the value of HouseSizeHouseSize at which the regression line intersects the horizontal axis?
 52.83 square feet
The regression line intersects the horizontal axis when Selling Price = $0, that is, when House Size = 52.83 square feet. 13,490.45+ 255.36*(52.83)=$0.00 (actually, 52.82914, which rounds to 52.83).
Given the regression equation, SellingPrice=13,490.45+255.36(HouseSize), which of the following values represents the value of SellingPriceSellingPrice at which the regression line intersects the vertical axis?
$13,490.45
13,490.45 is the yintercept, the value at which the regression line intersects the yaxis. This happens when House Size = 0, giving the equation: Selling Price = 13,490.45+255.36*0 = 13,490.45
Given the general regression equation, ŷ =a+bx, which of the following describes ŷ? Select all that apply.
 The expected value of y
 The expected value of x
 The independent variable
 The dependent variable
 The value we are trying to predict
 The intercept
 The expected value of y
 The dependent variable
 The value we are trying to predict
How to Forcast in Excel
Using the equation ŷ =a+bx
 Plug in the numbers! (See photo below)
Use Excel’s FORECAST function: (we didn't use this in the module, but it was mentioned)
=FORECAST(x, known_y’s, known_x’s)
 x is the data point for which you want to predict a value.
 known_y’s is the dependent array or range of data.
 known_x’s is the independent array or range of data.
 In order to use this function we must have the original data. This approach also gives us a point forecast, but does not provide other helpful values that Excel’s regression tool produces.
WHATEVER THE CASE, MAKE SURE THE PREDICTION IS WITHIN THE RANGE OF HISTORICAL DATA OR IT IS NOT A GOOD FORECAST.
Prediction Interval
Rather than predicting just a single point, we construct an interval, or range, around the point forecast.
A prediction interval is a range of values constructed around a point forecast. The center of the prediction interval is the point forecast, that is, the expected value of y for a specified value of x. The range of the interval extends above and below the point forecast. The width of the interval is based on the standard error of the regression and the desired level of confidence in the prediction.
 The center of the prediction interval is the point forecast in the case below, about $525,000. The standard error of the regression in this case, about $151,000 is a reasonable but conservative estimate of the forecast's standard deviation. The standard error of the regression is easily found in a regression output table.
 we have to choose a level of confidence for our prediction interval. A 95% prediction interval would run about two standard deviations above and below the point forecast. To forecast the price of a 2,000squarefoot home, the 95% prediction interval would be about $525,000 plus or minus two times $151,000.
 With this, we are able to say that we are 95% confident that the actual selling price will fall within the prediction interval.
 Since there is greater uncertainty when we forecast further from the mean of the independent variable, we can infer that the prediction interval should be wider as we move away from the average house size. So although the standard error is a reasonable estimate on which to base our range, the actual calculation is more complicated. As we move towards and then beyond the edges of the historical data, the width of the distribution around the point forecast increases. In this case, a 95% prediction interval for the selling price of a 7,000squarefoot home would be much wider than that for a 2,000squarefoot home.
(4.3.2)
The best point forecast for the selling price of a 2,500 square foot house is the expected selling price of a 2,500 square foot home, approximately 13,490 + 255.36(2,500) = $652,000. Given that the standard error of the regression is about $151,000, which of the following would give the BEST estimate for the prediction interval for a 2,500 square foot home with approximately 95% confidence?
$652,000 ± 2($151,000)
A prediction interval is centered at a point forecast, in this case $652,000. The standard error of the regression is multiplied by 2 since we wish to estimate the prediction interval at the 95% confidence level. Note that we are using 2 to approximate the zvalue for a 95% prediction interval. The actual zvalue corresponding to 95% (for sufficiently large samples) is 1.96.
The standard error of the regression is....
is a reasonable but conservative estimate of the forecast's standard deviation
How would the width of the actual prediction interval (at a 95% confidence level) for a 3,000 square foot home differ from the width of the actual prediction interval (at a 95% confidence level) for a 2,000 square foot home, given that the average home size is approximately 1,750 square feet?
The width of the actual prediction interval for a 3,000 square foot home would be larger than the width of the prediction interval for a 2,000 square foot home.
Because 3,000 square feet is further from the mean house size (1,750 square feet) than 2,000 square feet, the actual prediction interval at 3,000 square feet will be wider.
The width of the actual prediction interval is based on both the standard error of the regression and the distance from the mean; the actual prediction interval gets wider as the value of the independent variable moves further from the mean of the independent variable.
The image below compares prediction intervals created using both of the methods we have discussed. The red dashed lines show the actual prediction intervals for different house sizes. The blue dashed lines represent our method of estimating the prediction interval using the standard error of the regression. Note that the actual prediction intervals widen as house size moves further from the mean whereas the estimate prediction intervals, do not. They are parallel to the regression line.
Yes
3,500 lies well within the range of our historical housing data, so we can feel relatively comfortable with this prediction.
Given the regression equation, SellingPrice = 13,490.45 + 255.36(HouseSize), what do you expect the selling price of a 425 square foot home to be?
What is the vertical distance between a data point and the line?
The Residual Error.
This error is the difference between the observed value and the line's prediction for the dependent variable. This difference may be due to other factors that influence selling price or just a plain chance. Collectively, the residuals for all the data points measure how accurately a line fits a data set.
Variation unexplained by the regression line!
The Sum of Squared Errors, or the Residual Sum of Squares
The amount of variation that is not explained by the regression line. The residual sum of squares is equal to the sum of the squared residuals, that is, the sum of the squared differences between the observed values of the dependent variable and the predicted values of the dependent variable. To calculate the residual sum of squares, subtract the regression sum of squares from the total sum of squares.
For this, we take the square of each distance and then add all of those squared terms together.
A regression line is formally defined as the line that minimizes the sum of squared errors.
Total Sum of Squares
This is basically the sum of squared errors for the mean price line (the straight line in the graph below)
Having this mean price lines gives us a benchmark telling us how much more we know about our study if we have data about than if we do not.
Formal Definition: The variance of the dependent variable, that is, the sum of squared differences between the observed values of the dependent variable and the mean of the dependent variable. The total sum of squares is equal to the regression sum of squares plus the residual sum of squares.
Regression Sum of Squares
regression sum of squares
The amount of variation that is explained by the regression line. To calculate the regression sum of squares, subtract the residual sum of squares from the total sum of squares.
Residual Sum of Squares
The Residual Sum of Squares is the amount of variation that is left unexplained by the regression line, that is, the sum of the squared differences between the predicted and observed values. That is exactly what this graph shows.
Total Sum of Squares
The Total Sum of Squares is the variance of y, that is, the total variation in y. The Total Sum of Squares equals the sum of the squared differences between the observed values of y and the mean of y. That is exactly what the graph shows.
Rsquared) measures how closely a regression line fits a data set. It is a standardized measure of the regression line’s explanatory power. It is defined as the percentage of total variation in the dependent variable, y, that is explained by the regression line.
In single variable linear regression, i.e. a regression model that has only one independent variable, is the square of the correlation coefficient between the independent and dependent variables.
R^{2 }= (Correlation Coefficient)^{2}
Equivalently, the correlation coefficient is the positive or negative square root of R^{2}. The sign is determined by whether there is a positive or negative relationship between the two variables.
Correlation Coefficient (R) = ±√R^{2}
EXCEL: Adding Rsquared to a Scatter Plot
 We can quickly find the for a single variable linear regression by creating a scatter plot of the two variables.
 When we select Trendline and check the Display Equation box to display the equation of the best fit line, we can also check the Display Rsquared Value box to display the value on the scatter plot.
Rsquared can only take on values between 0 and 1. Let's look at those two extremes.
 When Rsquared equals 0, the regression line explains none of the variation in the dependent variable.
 When Rsquared equals 1, the line explains all of the variation in the dependent variable. The regression line fits the data perfectly.
As always, it's critical that we consider the problem we're trying to solve and its context before performing or evaluating a regression analysis.
Although we may wish for an Rsquared close to one, there are many contexts for which this is not realistic.
LOW Rsquared values are expected and accepted sometimes! KNOW YOUR FIELD/QUESTION
In fields such as human behavior, lower Rsquared values are both expected and accepted because human behavior is difficult to predict. Suppose we wish to know how well a person's tenth grade reading speed predicts his or her lifetime salary. There may be a relationship between the two variables, but we would not expect reading speed to be a very good predictor of salary. An Rsquared of 0.1 might be considered high in this case.
Another Way to Calculate R^{2}
Earlier in this module, we found that the correlation coefficient between house size and selling price is 0.86. What is the R2 of the best fit line that describes the relationship between selling price and house size?
0.74
Remember that for a single variable linear regression, R2 is the square of the correlation coefficient. Here, the correlation coefficient is 0.86, so R2=0.862=0.74.
0.80
The independent variable explains a lot of the variation in the dependent variable, but not quite all of it. In total, the data points are close to the best fit line, but they do not lie on it. Thus, an R2 of 0.80 seems like a good estimate.
the pvalue
the likelihood that we would select a sample at least as extreme as the one we observed if the null hypothesis were true
What are two ways to test whether the slope of the best fit line equals zero?
And why would we need to do this?
We need to do this because: if the 95% confidence interval for the slope does not include zero, we can be 95% confident that the true value of the slope is not zero and thus that a significant relationship exists between the variables.

Check whether the confidence interval for the line's slope contains zero
Remember that the coefficients of the regression line are just estimates of the true linear relationship between the dependent and independent variables. A coefficient’s lower 95% and upper 95% values give us the lower and upper bounds of the 95% confidence interval for that coefficient. Recall that if the best fit regression line has a slope of zero, then the regression line is just a flat line equal to the mean of the dependent variable, indicating that that there is no linear relationship between the two variables. Thus, if the 95% confidence interval for the slope does not include zero, we can be 95% confident that the true value of the slope is not zero and thus that a significant relationship exists between the variables. In the photo below, we can say we are 95% confident that the true slope of the regression line describing the relationship between selling price and house size is between 196.10 and 314.63. Because this range does not include the value zero, we can be 95% confident that there is a significant linear relationship between the variables.
 In the photo below, we can say we are 95% confident that the true slope of the regression line describing the relationship between selling price and house size is between 196.10 and 314.63. Because this range does not include the value zero, we can be 95% confident that there is a significant linear relationship between the variables.

Check whether the pvalue is greater than or equal to 0.05
As we noted earlier, regression analysis builds on hypothesis testing. In fact, a single variable linear regression analysis is equivalent to the hypothesis test,
Recall that the pvalue for a hypothesis test is the likelihood that we would select a sample at least as extreme as the one we observed if the null hypothesis were true. The pvalue associated with a regression coefficient is the likelihood of choosing a sample at least as extreme as the sample we used to derive the regression equation if the slope of the true regression line is actually zero, or equivalently, if there is no linear relationship between the two variables.
In the photo below, Since the pvalue for house size, 0.0000, is less than 0.05, we reject the null hypothesis that the slope is zero and can be confident that there is a significant linear relationship between selling price and house size. (We can ignore the pvalue of the intercept coefficient because the yintercept is just a constant. It does not represent an independent variable and thus provides no information about the significance of the relationship between two variables.)
Recall that a significance level of 5% corresponds to a confidence level of 95%, so checking whether a regression coefficient’s pvalue is less than 5% is equivalent to checking whether the coefficient’s 95% confidence interval contains zero. Both approaches test whether or not we can be 95% confident that that there is significant linear relationship between the variables.
Which of the following 95% confidence intervals for a regression line’s slope indicates that the linear relationship is not significant at the 5% level? Select all that apply.
 11.89; 2.17
 25.11; 44.37
 20.00; 5.00
 0.36; 0.55
 11.89; 2.17
Remember that the 95% confidence interval of the slope must contain zero to indicate that the linear relationship is not significant at the 5% level. 11.89 and 2.17 are both negative, so this range does not contain zero.  25.11; 44.37
Remember that the 95% confidence interval of the slope must contain zero to indicate that the linear relationship is not significant at the 5% level. 25.11 and 44.37 are both positive, so this range does not contain zero.  20.00; 5.00
The range between 20.00 and 5.00 contains zero, which indicates that the linear relationship is not significant at the 5% level. Note that another option is also correct.  0.36; 0.55
The range between 0.36 and 0.55 contains zero, which indicates that the linear relationship is not significant at the 5% level. Note that another option is also correct.
Yes
Since the pvalue of the independent variable, 0.0000, is less than 0.05, we can be 95% confident that there is a significant linear relationship between gross box office and home video units. We could also note that (19.58; 22.95), the 95% confidence interval for the slope, does not contain zero.
Close to 1
The data points are very close to the line, so our regression line must explain a very large proportion of the variation in yy. This would indicate a very high R2.
Less than 0.05
A pvalue less than 0.05 indicates that we can be 95% confident that the true slope is not zero, that is, that there is a significant linear relationship between the two variables. This graph provides strong evidence that there is a significant linear relationship between the two variables.
Smaller
A smaller R2 means that less variation in the dependent variable yy is explained by the regression line. Compared with the previous graph, the data points here are more dispersed around the regression line, indicating that less of the variation is explained by the regression line.
Less than 0.05
Even though the regression line has a smaller R2 than the previous regression, there is still clearly a strong linear relationship between the variables. A pvalue less than 0.05 indicates that we can be 95% confident that the true slope is not zero, i.e., that there is a significant linear relationship.
What would you say about the pvalues and R^{2 }values of the graphs below.
The 1st graph has:
 High R^{2 }(0.99): A large portion of the variation in y is explained by the regression line.

Low pvalue (0.0000): There is a significant linear relationship between the dependent and independent variables.
The 2nd graph has:
 Lower R^{2 } (0.70): A smaller portion of the variation in y is explained by the regression line than in the previous graph.
 Low pvalue (0.0000): There is a significant linear relationship between the dependent and independent variables, even though the R2 is lower than in the previous graph.
Residual Plots & How to create them
The residual plot is a scatter plot with residuals on the yaxis and the independent variable on the xaxis. The plot graphically represents the residual (the difference between the observed value and predicted value of the dependent variable) for each observation. Examining residual plots can provide significant insight into the relationships among variables and the validity of the assumptions underlying regression models.
How to Create One:
 We first take each observed data point and measure its residual the vertical distance from that point to the regression line.
 Then we graph each residual against the independent variable to form the residual plot.

If there is a linear relationship between the dependent and independent variables and the assumptions underlying regression analysis hold, we should not see any systematic pattern in the residual plot. The residual should be spread randomly above and below the horizontal axis. Specifically, based on the assumptions underlying linear regression, the distribution of the residual should follow a normal distribution with mean zero and a fixed variance.

If we do see a pattern in the residual plot, then it's possible that other factors may be influencing the dependent variable or that the linear model may not be the best fit for the data. For example, if the residuals appear to have a curved shape, there may be a nonlinear relationship between the dependent and independent variables.

If the residuals become larger as we move along the xaxis, we may be encountering a phenomenon known as heteroscedasticity. In heteroscedastic relationships, the variance changes systematically as the independent variable changes. This violates the assumption that the error terms follow a normal distribution with fixed variance.

Heteroskedasticity
A characteristic of the distribution of the residuals (error terms) in a regression. The error terms are heteroskedastic if the size of the error terms depends systematically upon the value(s) of the independent variable(s). Examining residual plots for patterns is useful for identifying heteroskedasticity (for example, if the error terms grow larger as the value of the independent variable grows larger, a classic funnel shape may be visible in the residual plot). Inferences drawn from a regression analysis with heteroskedastic error terms are suspect.
What several metrics are important to evaluate in order to determine whether a single variable linear regression model is a good fit for a data set??
And why do we need to do this?

R^{2}  measures the percent of total variation in the dependent variable, y, that is explained by the regression line.

Analyze the pvalue  we must test whether the relationship between the dependent and independent variable is significant and whether the linear model is a good fit for the data.

Note that the pvalue and R^{2 }provide different information. A linear relationship can be significant (have a low pvalue) but not explain a large percentage of the variation (not have a high .)

Check the confidence intervals associated with an independent variable’s coefficient indicates the likely range for that coefficient.
If the 95% confidence interval does not contain zero, we can be 95% confident that there is a significant linear relationship between the variables.

Residual plots can provide insights into whether a linear model is a good fit.

Note that the pvalue and R^{2 }provide different information. A linear relationship can be significant (have a low pvalue) but not explain a large percentage of the variation (not have a high .)
Check the confidence intervals associated with an independent variable’s coefficient indicates the likely range for that coefficient.
If the 95% confidence interval does not contain zero, we can be 95% confident that there is a significant linear relationship between the variables.
Residual plots can provide insights into whether a linear model is a good fit.
A pvalue to test the significance of a linear relationship between two variables was calculated to be 0.0210. What can we conclude? Select all that apply.
 We can be 90% confident that there is a significant linear relationship between the two variables.
 We can be 95% confident that there is a significant linear relationship between the two variables.
 We can be 98% confident that there is a significant linear relationship between the two variables.
 We can be 99% confident that there is a significant linear relationship between the two variables.

We can be 90% confident that there is a significant linear relationship between the two variables.
 Since the pvalue, 0.0210, is less than 10.90=0.10, we can be 90% confident that there is a significant linear relationship between the two variables. Note another option is also correct.

We can be 95% confident that there is a significant linear relationship between the two variables.
 Since the pvalue, 0.0210, is less than 10.95=0.05, we can be 95% confident that there is a significant linear relationship between the two variables. Note another option is also correct.
 We can be 98% confident that there is a significant linear relationship between the two variables.
 Since the pvalue, 0.0210, is greater than 10.98=0.02, we cannot be 98% confident that there is a significant linear relationship between the two variables.
 We can be 99% confident that there is a significant linear relationship between the two variables.
 Since the pvalue, 0.0210, is greater than 10.99=0.01, we cannot be 99% confident that there is a significant linear relationship between the two variables.
The linear relationship between two variables can be statistically significant but not explain a large percentage of the variation between the two variables. This would correspond to which pair of R^2 and pvalue?
 Low Rsquared, Low pvalue
 Low Rsquared, High pvalue
 High Rsquared, Low pvalue
 High Rsquared, High pvalue
Low Rsquared, Low pvalue
A low Rsquared and low pvalue indicates that the independent variable explains little variation in the dependent variable and the linear relationship between the two variables is significant.
How to Perform a Regression Analysis in Excel (also with dummy variables)
 From the Data menu, select Data Analysis > Regression.

Enter the appropriate Input Y Range and Input X Range:

The Input Y Range is the dependent variable, in this case selling price. The data are in column C with its label, C1:C31.

The Input X Range is the independent variable, in this case house size. To ensure the independent variable is labeled correctly in the output table, enter the data with its label in column B, B1:B31.
 Check the Labels box.
 Check the Residuals and Residual Plots boxes to ensure we see the relevant residual information.
 The Input Y Range is the dependent variable, in this case selling price. The data are in column C with its label, C1:C31.
 The Input X Range is the independent variable, in this case house size. To ensure the independent variable is labeled correctly in the output table, enter the data with its label in column B, B1:B31.
*With Dummy Variables, you make the exact same steps, just make sure that the equation is properly in the table for the dummy variable before you enter it into the analysis.
*We interpret a dummy variable’s coefficient in the same way we interpret a coefficient for a quantitative independent variable.
*The regression analysis gives us more information than the hypothesis test alone would. Rather than simply calculating the pvalue, rejecting the null hypothesis and concluding that there is a significant linear relationship, the regression results provide the direction and magnitude of this relationship.
The expected selling price of homes in school districts where students have low SAT scores is B15+B16*0=B15=$389,376. You must link directly to the values in order to obtain the correct answer.
The average selling price of homes, given they are located in school districts where students have low SAT scores can be calculated as AVERAGEIF(B2:B31,0,C2:C31)=$389,376.
The expected selling price of homes in school districts where students have average SAT scores above 1700 is B15+B16*1=B15+B16=$809,100. You must link directly to the values in order to obtain the correct answer.
What are the three main parts to the regression output table?
THREE MAIN PARTS TO THE TABLE
 the Regression Statistics table,
 the ANOVA table, and
 the Regression Coefficients table.
*Review more tables
“Stock of Corn at Start of Year” is the independent variable, and “Corn Acreage Planted” is the dependent variable. The beginning stock of corn at the start of the year will be used to predict the number of acres of corn that are planted.
80.36%
R2 is the amount of variation in home video units that is explained by this model. 80.36% of the variation in home video units can be explained by the relationship with gross box office sales.
The expected number of home video units that will be sold is B15+B16*360=7,074 thousand. You must link directly to the values in order to obtain the correct answer.