Correlation and Regression (4) Flashcards

1
Q

Relationships between variables

define covariance

A

• the importance of this lies in the concept of cause-and-effect, or that one variable
forces a different variable to behave in a potentially systematic way
• as we move away from a lake, does the amount of snow decrease?
• do house prices rise as you move further from a major highway?
• this is known as covariance – as one variable changes, a different variable changes as
well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

however, ______ does not necessarily always imply a causal relationship

A

covariance

‘correlation does not imply
causation’

-• although your data analysis may suggest a strong
relationship, you must interpret the results using
logic to confirm that the relationship is real

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

the primary tool for identifying covariance is using a _____, or if using more than
2 variables, a ______ ____

A

scatter plot

scatterplot matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

scatterplots

A

scatterplots show the relationship between 2 variables, by plotting the x variable
against the y variable
• the x variable is always the independent variable, or the cause
• the y variable is always the dependent variable, or the effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

the covariance value doesnt tell you much except that..

A

if positive, like 32.3, that the relationship is positive (as the independent increases so does the dependent).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Pearsons correlation coeffecient=

A

r

provides a measure of the strength of the
relationship between the 2 variables, which can only be used for straight line
relationships

-• to produce a more useful statistic, we can standardize the covariance to give us a value
that falls between -1 and 1
-this is done by adding standard deviations into the equation for correlation

r=0 means no relationship

r=+1 means strong positive relationship

r=-1 means strong negative relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

correlation analysis typically involves using a scatterplot and Pearson’s r to describe a
relationship, but normally we want to know if the relationship is statistically significant

for example, this relationship has r = -0.31, but we surely can’t look
at the variation in the data and say that it is a good relationship

A

in the example, we find that in fact this relationship is not
statistically significant, and therefore we might consider it
coincidental

• sample size plays a very important role here –
in general, smaller datasets need a higher r
value to be significant, and very large datasets
with a low r value could be significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

• ____ _____ plays a very important role here –
in general, ______ datasets need a higher r
value to be significant, and very ____ datasets
with a low r value could be significant

A

sample size

smaller

large

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Pearson’s r is a parametric measure of a relationship, but what if your data are nominal
or ordinal types and parametric measures don’t work?

A
  1. both are limited to the range [-1, +1]
  2. both coefficients are positive (negative) when an increase (a decrease) in X
    corresponds to an increase (a decrease) in Y
  3. a value near zero indicates that the values of X are uncorrelated with the values of
    Y
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

spearmans p

A

spearmans p:

  • rank all variables from lowest to highest
  • find absolute rank between two variables’ rankings, so if x is given a ranking of 7 in one variable and 8 in another, the absolute difference is 1
  • concordant pairs is the number of pairs which have matching ranks across variables
  • discordent pair is the number of pairs which do not match up across vairbales

then test for significance using Z test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Kendall’s R

A
• if both X and Y were perfectly
correlated, their ranks should
match perfectly
• when the pairs of ranks fall in
order, they are concordant
• when the pairs of ranks are
out of order, they are
discordant
• since every rank below this
pair is greater than 1, they
are all concordant
• since there is 1 rank below
less than 8, there are 2
concordant and 1 discordant

again test for significance using z test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Corelation for nominal or categorical datasets

A

.use contigency table

-• first, construct a table of expected
frequencies – what would we expect
the contingency table to look like if
everything was random?
-also construct a table of observed
  • use totals from each row and column to find expected %
  • gives you expected vs. actual

the calculate the x^(2) value based on difference between observed and expected
x^(2)=(f(o)xf(g))^2/f(g)

then use Tshuprow’s T, Cramer’s
V, or pearsons c to standardize for a value between -1 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

correlation can be strongly effected by _____ _______

A

.can be strongly affected by spatial autocorrelation

  • have to be careful in how we organize data,
  • particular to spatial data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

simple linear Regression

A

• simple, linear regression, defining a straight-line relationship between two
variables, is a first step towards modeling, the simulation of nature using equations
• a model is a simplification of reality, and can be used to understand a system, answer
questions about the system, or make predictions as to how the system may respond to a
stimulus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

regression model is

A

a regression model is a simple mathematical equation that simulates how y will respond
to a change in x

• regression can involve 2 variables (simple), more than 2 variables (multiple), and/or
non-linear relationships
• all regression models begin with a correlation analysis – this establishes the strength of
the relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Steps in Regression Analysis

A
  1. correlation analysis
    - determines strength of relationship

2.establish the nature of
the relationship – ex:how does day length affect air temperature?
-• this is done by fitting a “line of best fit” to the relationship
• this line is a simplification of the data – a regression model

17
Q

the regression model does 3 things to our

dataset:

A
1. it provides a simplified view of the
relationship
2. it provides a means to evaluate the
importance of the variables
3. it provides an opportunity to make
predictions beyond the data set
18
Q

y=mx+b
or
y=a+bx

A

y=dependent variable
m=slope
x=independent variable
b=y intercept

y=dependent variable
a=y intercept
b=slope
x=independent variable

19
Q

least squares

A
regression analysis seeks to minimize the
average size of the residuals through a
process known as “least squares”, or
minimizing the sum of the squared
distances between each data point and
the line of best fit
20
Q

m=r x (sy/sx)

sy=standard deviation of y
sx=standard deviation of x
r=pearsons r

b=y-mx

A

• you might notice that the numerator in the equation for b is the same as the equation
for the correlation coefficient, demonstrating the link between correlation and
regression analysis

Results of regression: numerator of regression equation is the correlation equation

21
Q

Resuslts of a regression

A
  • there will always be some kind of residuals in chart
  • so you end up with things that are explained by the equation and things that remain unexplained

• ideally, regression analysis will maximize the explained variability and minimize the
unexplained variability
• the proportion of the variability that is explained is called the coefficient of
determination, r
2

22
Q

coeefecient of determination

A

the proportion of the variability that is explained is called the coefficient of
determination, r
2
the value of r
2
ranges from 0, no variability is explained, to 1, all of the variability
is explained
• we can also use r
2 as a percentage – if r
2 = 0.75, then we have explained 75% of the
variability in the relationship, while also leaving 25% unexplained

  • bigger the sample, lower the c.o.d will probably be, even though its still good
  • smaller the sample, higher the C.O.D might be, even though it might not be good
23
Q

T OR F

IF LINE HAS NO SLOPE THERE IS NO RELATIONSHIP

24
Q

Confidence intervals put on the chart basically show the reader how many _______ are outside of the lines

25
of course, we would like an objective method to determine whether the coefficient of determination is statistically significant or not f-test=(r^2 x(n-2))/1-r^2
• notice that this test statistic is just the square of the test statistic in the correlation coefficient test – the r and r 2 tests will always share the same results • so, if Pearson’s r is significant, then so is r 2
26
assumptions to consider when we apply simple regression analysis to a data set (4)
• the relationship between x and y is linear and the equation for a straight line represents the model • the residuals have a mean = 0 and their variance does not vary with x • the residuals are all independent (they do not depend on one another) • for each value of x, the residuals have a normal distribution centred on the line of best fit • the assumption of linearity is important for both correlation and regression – if the relationship is not obviously linear, it may still be intrinsically linear (a non-linear relationship that can be transformed to linear); otherwise, it is an intrinsically nonlinear relationship and must be represented by something other than a straight line
27
why do a residual plot?
examining the residuals is a useful approach for interpreting the results of regression analysis – most software provides the option of plotting residuals for you • a residual plot should be a very boring looking plot – there should be no trends or patterns, just a cloud of data points • a line of best fit through the residuals should yield no useful regression model, r 2 should not be significant
28
Standard error
• the standard error can be thought of as the size of a typical residual, and since it is measured in terms of y, it shares the same units as the dataset • for example, in the day length vs air temperature data, the standard error is 7.6°C • this means that, on average, there is an error of ±7.6°C associated with our predictions made from the regression equation
29
another way of assessing our regression model is to ask if the slope of the best fit line is significantly different than 0
remember that the slope represents the rate of change of y as x changes, if the slope is 0, or no different than 0, it tells us that y is not changing significantly with x sb=(se^2)/((n-1)xsx^2) ttest=(b-b(fancy))/sb where b is the calculated slope,  is the hypothesized slope (= 0), sb is the standard deviation of the slope, se is the standard error, and sx is the standard deviation of the independent variable
30
Assumptions for a simple regression analysis(4)
remember the assumptions for simple regression analysis • the relationship between x and y is linear and the equation for a straight line represents the model • the residuals have a mean = 0 and their variance does not vary with x • the residuals are all independent (they do not depend on one another) • for each value of x, the residuals have a normal distribution centred on the line of best fit
31
when plotted, residuals should be _____ _____
normally distributed
32
alternative residual plot
• an alternative residual plot has the x-axis = predicted value and y-axis = residual (or standardized residual)