Linear regression (week 3-5) Flashcards
Linear regression formula
Y = alpha + BetaiXi +E
What is Y in Linear Regression?
What is alpha in Linear Regression?
What is Beta in Linear Regression?
What is X in Linear Regression?
What is E in Linear Regression?
Y is dependent var
Alpha is intercept parameter
Beta is regression coefficient
X is explanatory variable
E is i.i.d error term (use N(o, var))
Estimated Linear Regression
same, but put hat on all the coeff and E is Ɛ
E vs Ɛ?
E is i.i.d error (capture uncertainty) and Ɛ is residual term (diff data from model, is better it act more like E, contains uncertainty and what not captured)
What if its not linear?
Use log
How to minimize Ɛ
use minimize SSR (Sum Squared of Error)
1. Sum symbol (Y - Yhat)
2. derive! alpha and beta
3. Set the derivation to 0
what is δ^2?
it represents (Σ(y-yhat)^2)/n-2
why have n-2 in the δ^2?
it represent unbiased estimator
what if is too large?
use the formula alpha with squingy line on top and beta with squingy line and test the hypothesis for both alpha with squingy line on top and beta with squingy line
Goodness-of-fit measured using
R^2 = regression SS / Total SS
between 0% to 100%
calc test stats
use F1,n-2
R^2 means?
proportion of total data variability explained by model
Total Sum Of Square means?
Deviations between data and sample mean (total variability)
Regression SS mean?
Deviations between model estimate and sample mean (data variability explained by model)
Residual SS mean?
Deviations between data and model estimate (data variability unexplained by model)
what does Y* means
its using the new Y, or Y in the future trs dibagi 100
prediction interval of Y*
(alpha + (beta times x) ± tn-2 times sqrt(δ^2) times sqrt (1 + 1/n+ (x- mean of old x)^2 / ∑x^2 - n times sum of old x^2)
residual is .. - …
and good if
dependent var - fitted value
good if randomly scattered and normally distributed s
model fitting
estimate alpha and beta and the δ^2
Predict with 95% interval!
- model fitting
- goodness of fit
- plot residual (lebih scattered lebih bagus)
- prediction interval
Multicollinearity is
When 2 or more explanatory variable highly correlated, so it become vague, imprecise, and unreliable parameter estimates.
Adjusted R vs R^2
As explanatory var increase, R^2 also increase. However, overly complex model is also not good so we introduce adjusted R
Adjusted R formula
(1-(n-1)(1-R^2))/(n-k-1)
Forward selection
-start with single explanatory variable and see the adjusted R
-If the R is higher, its good then add the variable into the model
Backward selection
-Start with all explanatory variable
-Remove one by one, see the adjusted R