Chapter 2 Flashcards
(22 cards)
Correlation, what does it indicate and require:
Correlation indicates the direction of a linear relationship by its sign: r > 0 for a positive association and r < 0 for a negative association.
Correlation requires that both variables be quantitative, so that it makes sense to do the arithmetic indicated by the formula for r.
Because r uses the standardized values of the observations, r does not change when we change the units of measurement of x, y, or both. Measuring height in inches rather than centimeters and weight in pounds rather than kilograms does not change the correlation between height and weight. The correlation r itself has no unit of measurement; it is just a number.
Like the mean and standard deviation, the correlation is not resistant: r is strongly affected by a few outlying observations. Use r with caution when outliers
appear in the scatterplot.
Is correlation complete description of two variable
data?
correlation is not a complete description of two variable
data, even when the relationship between the variables is linear. You should give the means and standard deviations of both x and y along with the correlation.
What does correlation measure?
Correlation measures the strength of only the linear relationship between two variables. Correlation does not describe curved relationships between variables, no matter how strong they are
What does a Regression Line Summarize?
Regression line summarizes the relationship between two variables, but only in a specific setting: when one of the variables helps explain or predict the other. That is, regression describes a relationship between an explanatory variable and a response variable
Regression line (definition)
A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. Regression, unlike correlation, requires that we have an explanatory variable and a response variable.
Slope of a Regression Line
The slope b1 of a line y = b0 + b1x is the rate of change in the response y as the explanatory variable x changes. The slope of a regression line is an important numerical description of the relationship between the two variables
Purpose of a Regression Line.
We can use a regression line to predict the response y for a specific value of the explanatory variable x.
Extrapolation
Extrapolation is the use of a regression line for prediction far outside the range of values of the explanatory variable x used to obtain the line. Such predictions are often not accurate
Least-Squares Regression Line
error = observed gain − predicted gain
The least-squares idea: make the errors in predicting y as small as possible by minimizing the sum of their squares.
The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Regression line formula
yi = b0 + b1xi
Slope: b1 = r *(sy/sx)
Regressin line, important points
The expression b1 = rsy/sx for the slope says that, along the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y.
The least-squares regression line always passes through the point (x bar, y bar) on the graph of y against x.
r^2 IN REGRESSION
The square of the correlation, r^2, is the fraction of the variation in the values of y that is explained by the least squares regression of y on x.
When you report a regression, give r^2 as a measure of how successfully the regression explains the response
r^2 formula explanation
r^2 = (variance of predicted values Y hat) / (variance of observed values y)
Correlation and Regression
Correlation and regression are closely connected. The correlation r is the slope of the least-squares regression line when we measure both x and y in standardized units.
The square of the correlation r^2 is the fraction of the
variance of one variable that is explained by least squares regression on the other variable.
Residual
A residual is the difference between an observed value of the response variable and the value predicted by the regression line.
residual = observed y − predicted y
= y - y hat
Mean of Least Square Residuals
Although residuals can be calculated from any model fitted to the data, the residuals from the least-squares line have a special property: the mean of the least-squares residuals is always zero.
Residual Plot
A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess the fit of a regression line.
Issues with Regression Lines: outliers
Influential Points
Points that are outliers in the x direction can have a strong influence on the position of the regression line.
A point that is extreme in the x direction with no other points near it pulls the line toward itself.
An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Points that are outliers in the x direction of a scatterplot are often influential for the least-squares regression line.
The influence of a point that is an outlier in y depends on whether there are many other points with similar values of x that hold the line in place.
Issues with Regression Lines: Lurking Variables
An association between an explanatory variable x and a response variable y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y.
Correlation based on Averages
A correlation based on averages over many individuals is usually higher than the correlation between the same variables based on data for individuals.
Regression Line: influential Observations
individual points that substantially change the regression line. Influential observations are often outliers in the x direction, but they need not have large residuals
Correlation and Causation
High correlation does not imply causation.