unit 3 - ch 13 - correlation Flashcards
It’s all about relationships:
x - y
Correlation coefficient: terms
X-variable: independent variable, predictor variable
Y-variable: dependent variable, criterion variable, variable of interest
ANOVA:
2 variables ~ 1 nominal (factor), 1 at least interval (criterion variable)
Correlation:
2 variables ~ both variables at least interval
Notation:
Sample = r
population : p (rho)
r =
Correlation coefficient formula
sample correlation coefficent
X variable - x - x bar
Y variable - y - y bar
As the shared variation between the x variable and y variable increases, r approaches its upper or lower limit, respectively +1 and -1
+1 = perfect positive relationship
-1 = perfect negative relationship
0 = absolutely no relationship
R is a measure of both
The strength of the relationship and
The direction of the X - Y relationship
r has no unit of measurement
= unitless
r is not affected by the scale of the data
r values can be compared to each other
example of x and y
X variable: gas prices
Y variable: miles drive
correlation coefficient examples
EX 1
Income y axis
Height x axis
Positive correlation
Taller people make more money on average ):
EX 2
Customer satisfaction y axis
Difficulty in product setup x axis
Negative
Ikea
EX 3
Gpa y axis
Hat x axis
No correlation
EX 4
Control y axis
Speed x axis
Negative
Not all relationships are linear
Exponential, linear etc
EX 5
Performance y axis
Emotional involvement (stress) x axis
Curve (upside down U)
Two ends that are low
High end/peak
Step 6 of HT
EX: Are married men living longer or dying slower? Why?
EX 1
Data:
Alcohol content and calories for 10 beers
Calculating
X = alcohol content
Y = calories
r = 0.957
Testing
Step 1
4 facts of the null = everything is unrelated
Ho p = 0.00
H1 p =/= 0.00
Step 2
a = 0.01
Step 3
TS = observed - expected / chance
TS = r - p/ standard error of the correlation coefficient
TS = 9.97
P = 0.00 (from Ho)
Step 4
df = n -2
df = 9
CV = +/- 2.62
Step 5
9.79 > 2.62 = reject
TS > CV = reject
Step 6
As the alcohol content increases in beer, the calories also increase. That is not to say alcohol causes calories but both are the result of the beer making process. The conversion of sugar into alcohol during fermentation results in alcohol and calories. It is not a perfect correlation as carbohydrates within beer also contains calories.
Correlation vs. causation
r = 0.957
r increases =/= causation
High r does not mean x is causing y
X variable: length of our left arm
Y variable: length of our right arm
cautionary tales
- sample size
- relationship change
- correlation is not causation
- not all relationships are linear
cautionary notes: sample size
at least 10 data points for the x-variable (s) and 10 for the y-variable
Multiple x but only only variable
EXAMPLE
X = age of car
X = odometer miles
Y = selling price
10 points per x and 10 per Y = 30 data points
cautionary notes: relationships change
Over time
Outside the range of data
Don’t want to use relationship found in younger people sample on older people sample
Across space
Geographical
Drop model in a new space but sometimes it doesn’t hold up (american customers vs spanish customers)
cautionary notes: correlation is not causation
EX
Cigarettes and cancer = correlation and causation
Vitamins and better health = correlation
Suntan lotion and coral reef bleaching = correlation and causation
Gran turismo sales and subaru impreza sales = correlation and causation
Correlation → causation -{ liability, opportunity, beneficiary
cautionary notes: not all relationships are linear
Curve relationship (linear, exponential etc.)
EX 1
Academic performance
Listening to classical music
No relationship
EX 2
X = Cars per 1000 people
Y = Overall average BMI
r = 0.63
Step 1
4 facets of the null = everything is unrelated
Ho p = 0.00
H1 p =/= 0.000
Step 2
a = 0.10
Step 3
TS = 2.298
Step 4
DF = n -2
DF = 8
CV = 1.86
Step 5
2.298 > 1.86 = Reject
TS > CV = Reject
Step 6
Rather than car use leading to overweight perhaps people who are overweight are more likely to use cars (Y leads to X)
Getting to significance
r = square root of t squared divided by (t squared + (n-2))
Plug CV into t
EX
Plug into 2.298 > 1.86 = Reject
TS > CV = Reject
r = 0.55
Sample size affects…
Significance, strength and practicality
getting to significance
r = square root of t squared divided by (t squared + (n-2))
Higher sample size lower r significance
Significance → statistical question
Strength → labeling (talk/write)
Practicality → business judgment
figuring out significance based on r
- high r - significance - strength - practicality
- low r - significance - strength - practicality
- low r - significance - strength - practicality
- middling r - significance - strength - practicality
- High r - Increase - YES - strong/high - useful
- Low r - Decrease - NO - weak/low - not useful
- Low r - Decrease - YES - weak/low - not useful
- Middling r - middle - YES - moderate - maybe
bivariate data
it may be from two samples, but it is still a univariate variable. The type of data described in the examples above and for any model of cause and effect is bivariate data — “bi” for two variables. In reality, statisticians use multivariate data, meaning many variables
For our work we can classify data into three broad categories
time series data, cross-section data, and panel data
Time series data measures a single unit of observation; say a person, or a company or a country, as time passes.
A second type of data set is for cross-section data. Here the variation is not across time for a single unit of observation, but across units of observation during one point in time.
A third type of data set is panel data. Here a panel of units of observation is followed across time. If we take our example from above we might follow 500 people, the unit of observation, through time, ten years, and observe their income, price paid and quantity of the good purchased.
The correlation coefficient, ρ (pronounced rho)
is the mathematical statistic for a population that provides us with a measurement of the strength of a linear relationship between the two variables.
BUT ALWAYS REMEMBER THAT THE CORRELATION COEFFICIENT
DOES NOT TELL US THE SLOPE.