Lecture 3: Regression Assumptions Alt 2 Flashcards
(40 cards)
Why are statistical assumptions necessary in regression-based analyses?
To make inferential techniques tractable and generalizable, given the impossibility of designing tests for every possible data configuration.
When are violations of regression assumptions not necessarily cause for concern?
Particularly in the case of distributional violations in large samples.
What is the assumption of linearity in multiple regression?
The relationship between independent and dependent variables is assumed to be linear.
What happens if the true relationship is nonlinear in a regression model?
A linear model may misrepresent it or underestimate the association.
How can nonlinearity be modelled in a linear regression framework?
Using transformations or quadratic terms.
What does the assumption of normality of residuals entail?
Residuals are assumed to be normally distributed.
How important is the normality of residuals for inference in large samples?
It is shown to be largely unimportant for inference, especially in samples larger than about 10.
Do violations of residual normality bias regression estimates or p-values significantly?
No, simulation studies show regression remains robust across a variety of skewed distributions.
What is a downside of using non-parametric tests to address normality violations?
They often introduce more problems than they solve.
What are univariate outliers and how are they defined?
Univariate outliers are extreme values on one variable, defined as > 3 or > 3.29 standard deviations from the mean.
What are multivariate outliers in regression?
Unusual combinations of scores across variables.
What is Cook’s distance used for?
To quantify how much the regression changes when a data point is removed.
What value of Cook’s distance indicates a very influential or possibly problematic point?
Greater than 1.
What should be done with influential data points in regression analysis?
They should be examined for accuracy, and their impact reported if they substantively affect conclusions.
Under what conditions are extreme values less problematic due to the Central Limit Theorem?
When they are real and the sample size is large.
What should be done with extreme values in small sample sizes?
The regression should be run both including and excluding them, and this should be reported to the reader.
What is the assumption of homoskedasticity in regression?
Equal variance of residuals across all values of the independent variable (predictor).
What can violations of homoskedasticity (heteroskedasticity) lead to?
Inflated type I errors and biased coefficient estimates.
How can heteroskedasticity be tested?
Via residual plots or formal tests like the White test.
In what kinds of data is heteroskedasticity more likely to occur?
In highly skewed data with very long tails.
What are some causes of heteroskedasticity?
Un-modelled variables (e.g., a moderator affecting different processes at different levels of the IV) and nonlinear effects.
What are remedies for heteroskedasticity?
Transforming the dependent variable, modelling potential moderating variables, or applying heteroskedasticity-consistent standard errors.
What does applying heteroskedasticity-consistent standard errors correct?
Only inference issues, not coefficient bias.
What is arguably the most critical assumption in regression analysis?
Independence of observations.