250B Midterm Flashcards
What is regression towards the mean?
the only time z score of 1 on X predicts z score of 1 on Y is when correlation is 1.0 – if correlation <1, no matter where you are on X you be predicted closer to mean on Y as a function of how small the correlation is
Placebo/control group removes effect of regression to the mean: both groups will experience a tendency to regress to the mean but if the treatment group shows statistically significant difference, it can be attributed to treatment and not just random chance (regression to mean)
How does an individual case contribute to SStotal, SSresidual, and SSregression?
SStotal: sum(Y - Ybar)^2 –> to the extent that an obs is different from grand mean, SStotal increases
SSresidual: sum(Yhat - Y)^2 –> to the extent that an obs is different from its predicted value, SSresid increases
SSregression: sum(Yhat - Ybar)^2 –> obs contributes to grand mean
Why are cases further from the mean on X more important in determining a regression line?
Since Z scores for these observations are larger, they drive correlation between X and Y. Correlation plays a large part in determining regression lines – through the slope
What is standard error of estimate and does standard error of estimate (SEE) change as a function of r?
SEE is a measure of how much scores vary at each value of X (conditional on X)
As r^2 increases, SEE decreases!
How do SSyhat (SSreg) and SSresid change as r^2 goes from 0 to 1?
SSres = SSy(1 - r^2) so as r^2 increases, multiplier of SSy decreases, so SSresid decreases --> makes sense because as r^2 increases, you are explaining more variance by knowing X, so your unexplained variance should decrease SSreg = r^2 * SSy so as r^2 increases, multiplier of SSy increases and SSreg increases --> makes sense because your explained variance should increase!
What is the proportional improvement in prediction? How is it different from PRE?
PRE is percent reduction in error of predicting Y when we know X.
PIP = 1 - sqrt(1 - r^2) is the reduction in the size of the SEE and reduction in width of confint on our prediction
How is r^2 in regression similar to eta^2 in ANOVA?
Both represent magnitude of effect – how much overall variability in DV can be attributed to treatment effect – so same interpretation
Remember, eta^2 = SStreat/SStotal = SSreg/SSy, so they are basically the same thing
Describe assumptions of homogeneity of variance in arrays and normality in regression. When are these assumptions most critical?
Homogeneity of variance: variance of Y for each value of X is constant – necessary to ensure Sy|x is representative of variance of each array
Normality: in the population, values of Y corresponding to any specified value of X are normally distributed (aka errors are normally distributed) – necessary because we use the standard normal distribution
These assumptions are most important when we want to test hypotheses about b or set confidence limits on b or Y
What’s the biz when residuals violate distributional assumptions?
When homoscedasticity is violated, estimate of error variance is meaningless bc variance changes as a function of Yhat and it cannot be used as an error term, oh no!
Describe the assumption of bivariate normality in correlation when we are using r as an estimator of p
Bivariate normality: when X and Y are both random variables – when we are interested in correlation
We assume we are sampling X and Y from bivariate normal distribution – if you slice the distribution along either direction, you get normal conditional distributions (conditional on specific value of X or Y)
Marginal distribution – distribution of X over all values of Y or vice versa – also ~Normal
How would you develop a prediction interval around Yhat in a new sample and what properties would that prediction interval have? (no formulas just concept)
Prediction interval around a future prediction in a future sample needs to take into account uncertainty of mean of Y conditional on a fixed value of Y and the variability of observations around that mean
Properties: farther from mean on X means more uncertainty in prediction
In regression context, how do you test p = 0 using an ANOVA table and F-test comparing two variance estimates? How is this analogous to a one-way between groups ANOVA?
You do it like normal – divide MSmodel by MSresid. The F statistic you get will be the t stat squared.
This is like a one way ANOVA because we can think of it as looking at the correlation between group membership and DV
Given a covariance, correlation, slope and intercept, how would these values change if deviation scores were analyzed, or Z-scores analyzed?
Correlation does not change Covariance changes to correlation (bc corr = standardized cov) Slope changes (as does interpretation -- slope is now change in sd units for one unit sd change in X) Intercept changes to 0 bc mean of Z scores is 0
What are some possible roles of a LOES function in regression analysis?
We do this when in exploratory stages or if it looks like our data is curvilinear
Average Y values close to target value of predictor for different slices of the predictor variable
In a single predictor case, is the t-test (or F) of b the same as the test of r, that is, will the result always be the same?
Yes but only in single predictor case
What are some graphical methods of exploring properties of residuals in regression? What should residuals look like when the data meet the regression assumptions?
Make a histogram – residuals should be normally distributed about 0
Plot residuals as a fcn of IV – residuals should not vary sytematically with IV
Plot residuals as a fcn of Yhat – residuals should not change as a function of Yhat, should be scattered randomly
Describe the process of testing the difference between two regression slopes estimated in two different samples. How is this test similar to an independent groups t-test? Is this test the same as the test of the difference in two correlations, why or why not?
First you find SE(b1 - b2) using the variance sum law and then find a test statistic using it
It’s similar to an independent groups t test because the test statistic looks like a t statistic? idfk
It is not the same as a test of two correlations because the correlation is a standardized statistic – slopes can be equal across two groups, but if groups differ on variance on X or Y, correlations will differ