Correlation and Regression (4) Flashcards
Relationships between variables
define covariance
• the importance of this lies in the concept of cause-and-effect, or that one variable
forces a different variable to behave in a potentially systematic way
• as we move away from a lake, does the amount of snow decrease?
• do house prices rise as you move further from a major highway?
• this is known as covariance – as one variable changes, a different variable changes as
well
however, ______ does not necessarily always imply a causal relationship
covariance
‘correlation does not imply
causation’
-• although your data analysis may suggest a strong
relationship, you must interpret the results using
logic to confirm that the relationship is real
the primary tool for identifying covariance is using a _____, or if using more than
2 variables, a ______ ____
scatter plot
scatterplot matrix
scatterplots
scatterplots show the relationship between 2 variables, by plotting the x variable
against the y variable
• the x variable is always the independent variable, or the cause
• the y variable is always the dependent variable, or the effect
the covariance value doesnt tell you much except that..
if positive, like 32.3, that the relationship is positive (as the independent increases so does the dependent).
Pearsons correlation coeffecient=
r
provides a measure of the strength of the
relationship between the 2 variables, which can only be used for straight line
relationships
-• to produce a more useful statistic, we can standardize the covariance to give us a value
that falls between -1 and 1
-this is done by adding standard deviations into the equation for correlation
r=0 means no relationship
r=+1 means strong positive relationship
r=-1 means strong negative relationship
correlation analysis typically involves using a scatterplot and Pearson’s r to describe a
relationship, but normally we want to know if the relationship is statistically significant
for example, this relationship has r = -0.31, but we surely can’t look
at the variation in the data and say that it is a good relationship
in the example, we find that in fact this relationship is not
statistically significant, and therefore we might consider it
coincidental
• sample size plays a very important role here –
in general, smaller datasets need a higher r
value to be significant, and very large datasets
with a low r value could be significant
• ____ _____ plays a very important role here –
in general, ______ datasets need a higher r
value to be significant, and very ____ datasets
with a low r value could be significant
sample size
smaller
large
Pearson’s r is a parametric measure of a relationship, but what if your data are nominal
or ordinal types and parametric measures don’t work?
- both are limited to the range [-1, +1]
- both coefficients are positive (negative) when an increase (a decrease) in X
corresponds to an increase (a decrease) in Y - a value near zero indicates that the values of X are uncorrelated with the values of
Y
spearmans p
spearmans p:
- rank all variables from lowest to highest
- find absolute rank between two variables’ rankings, so if x is given a ranking of 7 in one variable and 8 in another, the absolute difference is 1
- concordant pairs is the number of pairs which have matching ranks across variables
- discordent pair is the number of pairs which do not match up across vairbales
then test for significance using Z test
Kendall’s R
• if both X and Y were perfectly correlated, their ranks should match perfectly • when the pairs of ranks fall in order, they are concordant • when the pairs of ranks are out of order, they are discordant • since every rank below this pair is greater than 1, they are all concordant • since there is 1 rank below less than 8, there are 2 concordant and 1 discordant
again test for significance using z test
Corelation for nominal or categorical datasets
.use contigency table
-• first, construct a table of expected frequencies – what would we expect the contingency table to look like if everything was random? -also construct a table of observed
- use totals from each row and column to find expected %
- gives you expected vs. actual
the calculate the x^(2) value based on difference between observed and expected
x^(2)=(f(o)xf(g))^2/f(g)
then use Tshuprow’s T, Cramer’s
V, or pearsons c to standardize for a value between -1 and 1
correlation can be strongly effected by _____ _______
.can be strongly affected by spatial autocorrelation
- have to be careful in how we organize data,
- particular to spatial data
simple linear Regression
• simple, linear regression, defining a straight-line relationship between two
variables, is a first step towards modeling, the simulation of nature using equations
• a model is a simplification of reality, and can be used to understand a system, answer
questions about the system, or make predictions as to how the system may respond to a
stimulus
regression model is
a regression model is a simple mathematical equation that simulates how y will respond
to a change in x
• regression can involve 2 variables (simple), more than 2 variables (multiple), and/or
non-linear relationships
• all regression models begin with a correlation analysis – this establishes the strength of
the relationship
Steps in Regression Analysis
- correlation analysis
- determines strength of relationship
2.establish the nature of
the relationship – ex:how does day length affect air temperature?
-• this is done by fitting a “line of best fit” to the relationship
• this line is a simplification of the data – a regression model
the regression model does 3 things to our
dataset:
1. it provides a simplified view of the relationship 2. it provides a means to evaluate the importance of the variables 3. it provides an opportunity to make predictions beyond the data set
y=mx+b
or
y=a+bx
y=dependent variable
m=slope
x=independent variable
b=y intercept
y=dependent variable
a=y intercept
b=slope
x=independent variable
least squares
regression analysis seeks to minimize the average size of the residuals through a process known as “least squares”, or minimizing the sum of the squared distances between each data point and the line of best fit
m=r x (sy/sx)
sy=standard deviation of y
sx=standard deviation of x
r=pearsons r
b=y-mx
• you might notice that the numerator in the equation for b is the same as the equation
for the correlation coefficient, demonstrating the link between correlation and
regression analysis
Results of regression: numerator of regression equation is the correlation equation
Resuslts of a regression
- there will always be some kind of residuals in chart
- so you end up with things that are explained by the equation and things that remain unexplained
• ideally, regression analysis will maximize the explained variability and minimize the
unexplained variability
• the proportion of the variability that is explained is called the coefficient of
determination, r
2
coeefecient of determination
the proportion of the variability that is explained is called the coefficient of
determination, r
2
the value of r
2
ranges from 0, no variability is explained, to 1, all of the variability
is explained
• we can also use r
2 as a percentage – if r
2 = 0.75, then we have explained 75% of the
variability in the relationship, while also leaving 25% unexplained
- bigger the sample, lower the c.o.d will probably be, even though its still good
- smaller the sample, higher the C.O.D might be, even though it might not be good
T OR F
IF LINE HAS NO SLOPE THERE IS NO RELATIONSHIP
T
Confidence intervals put on the chart basically show the reader how many _______ are outside of the lines
residuals