W5 Flashcards
(38 cards)
Regression Assumptions – Importance
We need to follow certain assumptions in regression to make valid conclusions. Violating them can lead to misleading or biased results.
Why assumptions matter
If you break assumptions: (a) You risk biased estimates that reduce the credibility of your results. (b) You might draw flawed conclusions from your tests.
How to deal with assumptions
2 steps: (a) Diagnostics: Use tools and tests to check if assumptions hold. (b) Solutions: If assumptions are violated, use techniques to fix or minimize the issue.
Assumption 1: Linear relationships
We assume a linear relationship between independent and dependent variables. That means: when plotted, the data should form a straight line.
What happens if it’s not linear?
If the relationship is curved (like a U-shape or step-like), your regression will be off. That means your results and conclusions could be wrong.
Spotting non-linearity
Use a residual plot (predicted values vs. residuals). If the residuals form a straight line = all good. If there’s a curve or weird pattern = not linear.
Real-world examples of non-linearity
Quadratic patterns, - Step relationships (e.g. from dummy variables), - Sudden jumps or gaps in the trend line, All of these violate the linearity assumption.
What to do if it’s not linear?
You can transform the variables! Example: use a log transformation (like taking the natural log of the variable) to straighten the pattern.
Regression Assumptions – Overview
Regression assumptions are needed to make sure your conclusions are accurate. If violated, they can lead to biased estimates and wrong significance tests. We use diagnostics and solutions to check and improve robustness.
Assumption 1: Linear Relationships
Assumes a straight-line relationship between independent and dependent variables. The regression line should cut through the scatter in a straight path.
Diagnosing Linearity
Plot the residuals (errors) vs. fitted values. If you see a straight horizontal line, it’s linear. If there’s a curve or pattern, it’s not.
Fixing Non-Linearity
You can use transformations like a log or quadratic transformation to straighten the relationship and reduce bias.
Assumption 2: Normally Distributed Errors
In regression, we assume that the residuals (errors) are normally distributed with a mean of 0 and constant variance. Violations can make significance tests unreliable.
Diagnosing Normality
Use histograms or Q-Q plots of the residuals. If they form a bell shape or line up well with the normal line, you’re good.
Fixing Normality Issues
You can transform the variables (like taking the log) or split the data into groups using dummy variables.
Assumption 3: Independence of Error Terms
Each error should be independent. This means one person’s error shouldn’t be related to someone else’s.
When Independence is Violated
Happens with clusters or repeated surveys. Use time plots or residual plots by group to detect patterns.
Assumption 4: Homoscedasticity
The variance of residuals should be consistent across all levels of predictors. If the variance spreads out or narrows in a pattern, that’s a problem.
Diagnosing Homoscedasticity
Use residual vs. fitted plots or scale-location plots. A fan or cone shape means heteroscedasticity (bad).
Assumption 5: No Multicollinearity
Independent variables shouldn’t be too highly correlated. If they are, the model can’t tell which variable causes what.
Diagnosing Multicollinearity
Check the correlation matrix of predictors. Watch out if values are near +1 or -1. This suggests high multicollinearity.
Fixing Multicollinearity
Drop one variable, combine them, or test if it really matters (does it change conclusions?). You can also use regularisation methods in more advanced courses.
Why use standardisation in regression?
To compare variables that use different scales (e.g. 1–10 vs. 1–6) by converting them to standard deviations.
How do you standardise a variable?
Use: Xstd = (X − mean) / standard deviation. This rescales the variable to have mean 0 and SD 1.