2nd set Flashcards
(33 cards)
What does “Ordinary Least Squares” (OLS) accomplish in linear regression?
OLS finds coefficient estimates (B0, B1,…) that minimize the sum of squared residuals. Concretely, it solves: add eqn
resulting in a “best fit” line or hyperplane in higher dimensions.
Why do we square the residuals in OLS rather than take absolute values or another measure?
Squaring residuals:
Penalizes larger errors more heavily,
Makes the function differentiable, facilitating a closed-form solution in linear regression,
Aligns with the assumption of normally distributed errors in classical linear models.
How are the OLS estimates B1 and B0 derived in simple linear regression?
eqns with Slope (B1) and Intercept (B0)
They ensure the regression line passes through (x_avg,y_avg)
What assumptions underlie the classical linear regression model? (LINE)
L - linear in parameters B0 and B1
I = Independent errors: eta’s should be independent of one another and therefore no correlation among residuals
N = normality: normally distributed eta’s (w/ mean of zero) for B0 and B1 “centered around zero”
E- equal variance: variability 1 = variability 2
Describe the basic procedure for bootstrapping regression coefficients.
Sample with replacement from the original dataset to create a “bootstrap” sample of the same size.
Fit the regression to this bootstrap sample, and record the estimated coefficients (i.e B1)
Repeat many times (B = 1000+).
Use the distribution of B1 alues to estimate standard errors and form confidence intervals.
How do you form a t-based confidence interval (CI) for a slope using bootstrap estimates?
From your B bootstrap estimates, compute the sample standard deviation, SE(B1)
Pick a confidence level (e.g., 95%) and find the critical t-value with appropriate degrees of freedom (often n−2 for simple linear regression).
Construct the CI: eqn
Why might we prefer bootstrapping methods over standard (theoretical) methods in some regression analyses?
Bootstrapping relaxes some distributional assumptions (normal errors).
It’s more robust when sample sizes are small or when errors are not strictly normal.
It provides a data-driven way to approximate the sampling distribution of the estimator.
What are degrees of freedom (DoF) in a regression context?
In simple linear regression df = n - 2,
in multiple regression with p predictors (including the intercept) df = n - p - 1
Conceptually, DoF reflects the number of independent pieces of information remaining after estimating parameters.
How do you find a critical t-value from a t-table for a regression slope hypothesis test?
Determine degrees of freedom = n-p-1
Choose the significance level alpha = 0.05
if its two-tailed look for alpha/2 in the t-table row for corresponding df
The table gives a critical t-val which we compare to the computed t-stat to see if it exceeds the critical val
What is the likelihood function
L(θ) and how do we use it in MLE?
The likelihood function measures how likely the observed data is for a given parameter θ.
MLE maximizes L(θ) (or equivalently, the log-likelihood) to find the best-fitting θ
Common examples include binomial MLE where
p_hat =k/n.
Explain the role of log-likelihood in MLE.
The log-likelihood is ℓ(θ)=lnL(θ).
It transforms products into sums, making it easier to differentiate and find the maximum.
The value that maximizes ℓ(θ) also maximizes L(θ).
What are the main steps in a maximum likelihood estimation procedure?
Specify the likelihood function L(θ) based on the assumed distribution of data.
Take the natural log: ℓ(θ)=lnL(θ)
Differentiate ℓ(θ) w.r.t. θ and set = 0 to find critical points.
Solve for θ , check it maximizes the function.
Interpret θ in context; compute standard errors, confidence intervals, etc.
What is the hypothesis test for a single predictor’s coefficient in a multiple regression model?
Null Hypothesis (H0): Bj=0 so predictor xj has no effect
Alternate Hyp. (HA): Bj != 0
Test Stat (t) eqn
If abs(t) > t(two-tailed) reject H0
How is the F-test used in MULTI-LINEAR regression for overall significance?
Null Hypothesis (H0): All slopes = 0
Alternate (HA): At least one slope != 0
The F-stat compares the model with predictors to a baseline model (just an intercept)
A large F value (and small p-value) indicates the model explains significantly more variance than the baseline.
Why does inference rely on assumptions such as normality of residuals?
Many test statistics (t, F) are derived under the assumption that errors follow a normal distribution with mean 0 and constant variance.
This allows us to use known distributions (t and F) to calculate p-values.
If normality is violated, alternative methods (like bootstrapping) or transformations may be needed.
What are SSE, SST, and SSR in linear regression?
SSE (Sum of Squared Errors): unexplained variation.
sum(yi-yi_hat)^2
SST (Total Sum of Squares): total variation in y.
sum(yi-y_bar)^2
SSR (Regression Sum of Squares): explained variation.
sum(yi_hat-y_bar)^2
They relate by: SST=SSR+SSE.
Define R^2 and adjusted R^2. Why might Adjusted R^2 be preferred in multiple regression?
R2 = 1 - (SSE/SST) proportion of variance explained
Adjusted R2 = (SSE normalized over n-p-1) and (SST normalized over n-1)
The adjusted R2 penalized for additional predictors, preventing overfitting and giving a more unbiased fit
How do we detect and address multicollinearity in multiple regression?
Detect via correlation matrix or Variance Inflation Factor (VIF).
VIFi = (1/(1-Ri^2)) with a regressions with our i-var as the response vs all others as predictor variables
VIF = 1 R^2=0, no correlation
under 5, R^2<0.8 moderate correlation
over 5, R^2 >0.8 highly correlated
over 10, R^2 >= 0.9, significant so need to correct by grouping two together possibly
Solutions: Remove/merge correlated predictors, use regularization (Ridge), or collect more data.
What is forward selection in model building?
Start with no predictors.
Test each available predictor individually, add the one that gives the greatest improvement (e.g., in R^2).
Repeat until adding further predictors fails to significantly improve the model.
What is backward elimination in model building?
Start with all predictors.
Remove the least significant predictor (highest p-value or minimal improvement in R^2).
Repeat until all remaining predictors are significant or further removal degrades the model.
Why are forward selection and backward elimination important?
They’re stepwise approaches to reduce a large set of predictors to a more parsimonious model.
They prevent overfitting by removing variables that add little predictive power.
What is the logistic regression model formula?
ln(p/(1-p))= B0+B1x1+….Bpxp with p being the probability of the “success” class
We use this for 2-level categorial variables
Use logit(pi)=ln(pi/(1-pi) above which we use to find pi=3.142(e^B)/(1+e^B)
Why do we use logistic regression instead of linear regression for binary outcomes?
Binary data often violate linear regression assumptions.
Logistic regression constrains predictions between 0 and 1.
The log-odds transformation (logit) is compatible with a wide range of distributions and yields interpretable odds ratios.
How are logistic regression parameters estimated?
By using maximum likelihood, where we find the B that maximizes this likelihood