Prediction Flashcards
(23 cards)
prediction
using scores or one variable (X) to predict scores on another variable (Y), based on known correlation between them rXY
- A method for predicting Y from X, using information about XY’s relationship
Note on Prediction
First need a sample in which both pieces of information (X and Y) are available, have an established correlation
Then, use this information in a new sample where only X scores are available, in order to predict the Y scores
Features of Prediction
- often temporal symmetry: X is measured at an earlier occasion than Y
- Y is referred to as a dependent or criterion variable
- X is referred to as an independent or predictor variable
Simple VS Multiple Regression
- simple regression: one X and one Y
- multiple regression: more than one X
there is a focus on simple regression in this course
Regression Equation
also called a prediction equation
y = a+bX (like high school)
- equation describes a straight-line that best fits the data points in 2 dimensional (X-Y) space
Mapping Correlation to Prediction
y’ or y^ = predicted score on y
y’ = My
Step 1: Convert Mx (mean scores of X) to Zx, Zx = X-M / Sx
Step 2: equation in z-scores, Zy’ = rZx
Step 3: Zy’ to Y’ (predicted __), Y’ = My + Zy’Sy
If r =+1.0 then Zy’ = Zx, a thought experiment
would never typically get r= +1.0
- what is being predicted would equal what has already occurred (possibly?)
- a perfect relationship between the two variables
If r is between 0 and 1,
X provides some information to help us predict Y, but the relationship is not perfect (other factors and noise involved)
Note 2 on Prediction
The amount by which our prediction differs from 0 will depend on the strength of r
The optimal prediction is given by:
Zy’ = rZx (prediction equation in terms of z-scores)
two extremes = r=0, r=1
- in between is where the prediction of Y based on X lies
The prediction equation on a graph
- regression line passes through origin and has a slope of r
- prediction equation describes the line of best fit through the scatter plot of Zy against Zx
Building predictions from raw scores
Y’ = a+bX
where b = r sy/sx
where a = My - bMx
Jamovi for Linear Regression
“estimate” refers to raw score regression, intercept = a, x = b
- jamovi will produce standardised estimates, which refer to the z-score equation, if you select standardised estimate under model coefficients
r^2
r^2 = proportion of variance accounted for
how well we are making predictions, to know if we should keep using this model or not
y - My = y’-My + y-y’
Deviation = prediction + error (residual)
y scores from the mean = predicted scores from the mean + differences in predicted scores and real scores
r^2 in equations
y - My = y’ - My + y - y’
Σ(y-My)^2 = Σ(y’-My)^2 + Σ(y-y’)^2
SS(total) = SS(regression) + SS(residual)
Finding scores due to prediction (r^2)
r^2 = Σ(y’-My)^2/Σ(Y-My)^2
= ss (regression)/ ss(total)
proportion of variability in y scores associated with changes in X (i.e. due to prediction)
Finding scores not due to prediction
1-r^2 = Σ(y-y’)^2/ Σ(y-My)^2
= ss(residual)/ss(total)
proportion of variability in Y scores not associated with changes in X (i.e. not due to prediction)
regression line is defined so that ss (residual) is minimised (based on least squares criterion)
no other straight line will generate a smaller ss (residual) than y’ = a+bX
when error in prediction when r is large
- y values will cluster around y’ values
- a larger proportion of Sy^2 is accounted for by prediction
Error in prediction when r is small
- y values will vary more widely around y’ values
- smaller proportion of Sy^2 is accounted for by prediction
Note on r^2
r and r^2 convey information about how well we can predict scores on Y for the sample as a whole
- what about how well we are predicting individual subjects? first look at assumptions being made when undergoing a prediction analyses
Assumptions for Linear Regression
in population, X and Y form a bivariate normal distribution
- both samples come from a normal distribution
For each X there is a normal distribution of Y scores
- y’ is what we expect y to be on average, under the circumstance of X
- we won’t get perfect y’ every time, but it will be normal, the highest likelihood is that y’ is closest to Y
- mean of that distribution is what y is represented as
Linearity
- X and Y is linearly related
Homoscedasticity
- variance of distribution of y scores for each X score is the same
- each y distributions should have the same deviation
When assumptions of linear regression are met:
can use prediction to estimate:
- percentage of cases that are a certain distance from their predicted value
- probability of a score being a certain distance from its predicted value
Standard Error of Estimate
SD of distribution of observed scores around corresponding predicted score
- measures predictive error
- under assumption of bivariate normality Syx is the standard deviation of normal distribution of y scores, for any value of X
equation:
Syx = √ Σ(y-y’)^2/n-2
alternative, (when large sample sizes)
Syx = Sy √ (n-1)(1-r^2) / n-2