final Flashcards
(57 cards)
univar numerical key eqns
y=Bo+B1x1+error
df = n-2
for null just B1=0, so non-linear
multivar numerical key eqns
y=Bo+B1x1+…BpXp+error
df=n-p-1
for null to be true B1…Bp =0 for the whole model, so non-linear
But some of the x’s by themselfs if individually compared to the response in its own model can be != 0
What are the assumptions for OLS
L - linear in parameters B0 and B1
I - Independent errors: eta’s should be independent of one another and therefore no correlation among residuals
N - normality: normally distributed eta’s (w/ mean of zero) for B0 and B1 “centered around zero”
E- equal variance: variability 1 = variability 2
How does the OLS matrix math work out (one can be used for both uni and multi just that x=n*(p+1)
y=B*x+error
with
b=[(X’X)^-1]X’Y
with b = [Bo; B1;….]
Explain what residual deviance is and how to calculate for it
Residual deviance is a measure of how well the model fits the saturated one (which is where the model perfectly fits all data points in the range)
The lower the SS the better the fit
SS = -2(L1-Ls where L1 is the log likelihood for the current model and Ls is the log likelihood for saturated model
How do you find a critical t-value from a t-table for a regression slope hypothesis test?
Determine degrees of freedom = n-p-1 or n-2
Choose the significance level alpha = 0.05
if its two-tailed look for alpha/2 in the t-score table row for corresponding df
Tscore=(B1-0)/SE(B1) which is the estimate - null over SE
The table gives a critical t-val which we compare to the computed t-score to see if it exceeds the critical val
Calculate CI with bi +_ t*df * SEbi
What is the hypothesis test for a single predictor’s coefficient in a multiple regression model?
Null Hypothesis (H0): Bj=0 so predictor xj has no effect
Alternate Hyp. (HA): Bj != 0
Test Stat (t) eqn
If abs(t-table calc) > t(two-tailed) reject H0
What are SSE, SST in linear regression?
SSE (Sum of Squared Errors): unexplained variation.
sum(yi-yi_hat)^2
to regression
SST (Total Sum of Squares): total variation in y.
sum(yi-y_bar)^2
to average “explained variation”
How do we detect and address multicollinearity in multiple regression?
Detect via correlation matrix or Variance Inflation Factor (VIF).
VIFi = (1/(1-Ri^2)) with a regressions with our i-var as the response vs all others as predictor variables
VIF = 1 R^2=0, no correlation
under 5, R^2<0.8 moderate correlation
over 5, R^2 >0.8 highly correlated
over 10, R^2 >= 0.9, significant so need to correct by grouping two together possibly
Solutions: Remove/merge correlated predictors, use regularization (Ridge), or collect more data.
What is forward selection in model building?
Start with no predictors.
Test each available predictor individually, add the one that gives the greatest improvement (e.g., in R^2).
Repeat until adding further predictors fails to significantly improve the model.
What is backward elimination in model building?
Start with all predictors.
Remove the least significant predictor (highest p-value or minimal improvement in R^2).
Repeat until all remaining predictors are significant or further removal degrades the model.
Why are forward selection and backward elimination important?
They’re stepwise approaches to reduce a large set of predictors to a more parsimonious model.
They prevent overfitting by removing variables that add little predictive power.
What is the logistic regression model formula?
ln(p/(1-p))= B0+B1x1+….Bpxp with p being the probability of the “success” class
We use this for 2-level categorial variables
Use logit(pi)=ln(pi/(1-pi) above which we use to find pi=(e^B)/(1+e^B)
with B being B0+B1x1+…Bpxp
How does multi-var inference work
1) start with predictors set as Bvalues B1…Bp and create a Ho for all of them
2) set up p-value ranges
3) Run regression to see if that predictor var is important on its own in the presence of other predictors
what is multi-collinearity
A strong correpsondence between two predictor variables
How is multi-collinearity explored
1) Explore the observations so as each predictor variable increases, the response variable does what in turn
2) See if theres eqns that can link some of the PVs together (say each PV on its own could have low p-values but when all combined in one multiple refression they can all be highly corelated with one another resulting in a high degree of collinearity and interpretations cannot hold)
how is categorical data regressed upon
logistic regression which holds by transformation(pi)=Bo+B1x1+…Bpxp
where after the transformation is applied we can solve for pi, which is the probability for the “Ha change” to occur
what do the seed functions for logistic regression do
apply to different categorical functions
logit - gives odds ratios by ln(pi/(1-pi)
log - ln(lambda) which is used for the poisson distribution which is for the mean # of counts
what is pi in logistic regression
the probability of an even occurring such that Yi=Pi in relation to our xp, like the probability of sucess
Why do we use logistic regression instead of linear regression for binary outcomes?
Binary data often violate linear regression assumptions.
Logistic regression constrains predictions between 0 and 1.
The log-odds transformation (logit) is compatible with a wide range of distributions and yields interpretable odds ratios.
How are logistic regression parameters estimated?
By using maximum likelihood, where we find the B that maximizes this likelihood from simply doing (amount var1 “change one”)/total observations
It doesnt produce the B’s but helps to optimize them, which we then use these optimized B’s (MLE estimates) to plug into logit to solve for the probability
Interpret the slope Bj in logistic regression
Bj is the change in the log-odds of success for a 1-unit increase in xj, holding the other predpictors constant
The odds ratio for a 1-unit increase in xj is exp(Bj) = e^B1
What is k-fold cross-validation, and why do we use it in regression?
In k-fold CV, the dataset is split into k-folds. We then train on k-1 folds and validate on the left-out odd
We then average the performance by using RMSE or R2 across the folds
This approach provides a better estimate of the model performance and helps to dected overfitting
How do residual plots help diagnose issues in linear regression?
A random scatter of points suggests that linear assumptions and homoscedasticity might hold.
A pattern (e.g., curved shape, funnels) indicates possible non-linearity or heteroskedasticity.
Systematic patterns can also suggest outliers or missing predictors.