2nd set Flashcards by Ian Cartwright

What does “Ordinary Least Squares” (OLS) accomplish in linear regression?

OLS finds coefficient estimates (B0, B1,…) that minimize the sum of squared residuals. Concretely, it solves: add eqn

resulting in a “best fit” line or hyperplane in higher dimensions.

How well did you know this?

Not at all

Perfectly

Why do we square the residuals in OLS rather than take absolute values or another measure?

Squaring residuals:
Penalizes larger errors more heavily,
Makes the function differentiable, facilitating a closed-form solution in linear regression,
Aligns with the assumption of normally distributed errors in classical linear models.

How well did you know this?

Not at all

Perfectly

How are the OLS estimates B1 and B0 derived in simple linear regression?

eqns with Slope (B1) and Intercept (B0)
They ensure the regression line passes through (x_avg,y_avg)

How well did you know this?

Not at all

Perfectly

What assumptions underlie the classical linear regression model? (LINE)

L - linear in parameters B0 and B1
I = Independent errors: eta’s should be independent of one another and therefore no correlation among residuals
N = normality: normally distributed eta’s (w/ mean of zero) for B0 and B1 “centered around zero”
E- equal variance: variability 1 = variability 2

How well did you know this?

Not at all

Perfectly

Describe the basic procedure for bootstrapping regression coefficients.

Sample with replacement from the original dataset to create a “bootstrap” sample of the same size.
Fit the regression to this bootstrap sample, and record the estimated coefficients (i.e B1)
Repeat many times (B = 1000+).
Use the distribution of B1 alues to estimate standard errors and form confidence intervals.

How well did you know this?

Not at all

Perfectly

How do you form a t-based confidence interval (CI) for a slope using bootstrap estimates?

From your B bootstrap estimates, compute the sample standard deviation, SE(B1)
Pick a confidence level (e.g., 95%) and find the critical t-value with appropriate degrees of freedom (often n−2 for simple linear regression).
Construct the CI: eqn

How well did you know this?

Not at all

Perfectly

Why might we prefer bootstrapping methods over standard (theoretical) methods in some regression analyses?

Bootstrapping relaxes some distributional assumptions (normal errors).
It’s more robust when sample sizes are small or when errors are not strictly normal.
It provides a data-driven way to approximate the sampling distribution of the estimator.

How well did you know this?

Not at all

Perfectly

What are degrees of freedom (DoF) in a regression context?

In simple linear regression df = n - 2,
in multiple regression with p predictors (including the intercept) df = n - p - 1
Conceptually, DoF reflects the number of independent pieces of information remaining after estimating parameters.

How well did you know this?

Not at all

Perfectly

How do you find a critical t-value from a t-table for a regression slope hypothesis test?

Determine degrees of freedom = n-p-1

Choose the significance level alpha = 0.05
if its two-tailed look for alpha/2 in the t-table row for corresponding df

The table gives a critical t-val which we compare to the computed t-stat to see if it exceeds the critical val

How well did you know this?

Not at all

Perfectly

What is the likelihood function
L(θ) and how do we use it in MLE?

The likelihood function measures how likely the observed data is for a given parameter θ.

MLE maximizes L(θ) (or equivalently, the log-likelihood) to find the best-fitting θ

Common examples include binomial MLE where
p_hat =k/n.

How well did you know this?

Not at all

Perfectly

Explain the role of log-likelihood in MLE.

The log-likelihood is ℓ(θ)=lnL(θ).

It transforms products into sums, making it easier to differentiate and find the maximum.

The value that maximizes ℓ(θ) also maximizes L(θ).

How well did you know this?

Not at all

Perfectly

What are the main steps in a maximum likelihood estimation procedure?

Specify the likelihood function L(θ) based on the assumed distribution of data.

Take the natural log: ℓ(θ)=lnL(θ)

Differentiate ℓ(θ) w.r.t. θ and set = 0 to find critical points.

Solve for θ , check it maximizes the function.

Interpret θ in context; compute standard errors, confidence intervals, etc.

How well did you know this?

Not at all

Perfectly

What is the hypothesis test for a single predictor’s coefficient in a multiple regression model?

Null Hypothesis (H0): Bj=0 so predictor xj has no effect
Alternate Hyp. (HA): Bj != 0
Test Stat (t) eqn
If abs(t) > t(two-tailed) reject H0

How well did you know this?

Not at all

Perfectly

How is the F-test used in MULTI-LINEAR regression for overall significance?

Null Hypothesis (H0): All slopes = 0
Alternate (HA): At least one slope != 0

The F-stat compares the model with predictors to a baseline model (just an intercept)

A large F value (and small p-value) indicates the model explains significantly more variance than the baseline.

How well did you know this?

Not at all

Perfectly

Why does inference rely on assumptions such as normality of residuals?

Many test statistics (t, F) are derived under the assumption that errors follow a normal distribution with mean 0 and constant variance.

This allows us to use known distributions (t and F) to calculate p-values.

If normality is violated, alternative methods (like bootstrapping) or transformations may be needed.

How well did you know this?

Not at all

Perfectly

What are SSE, SST, and SSR in linear regression?

Study These Flashcards

SSE (Sum of Squared Errors): unexplained variation.
sum(yi-yi_hat)^2
SST (Total Sum of Squares): total variation in y.
sum(yi-y_bar)^2
SSR (Regression Sum of Squares): explained variation.
sum(yi_hat-y_bar)^2

They relate by: SST=SSR+SSE.

Define R^2 and adjusted R^2. Why might Adjusted R^2 be preferred in multiple regression?

Study These Flashcards

R2 = 1 - (SSE/SST) proportion of variance explained
Adjusted R2 = (SSE normalized over n-p-1) and (SST normalized over n-1)
The adjusted R2 penalized for additional predictors, preventing overfitting and giving a more unbiased fit

How do we detect and address multicollinearity in multiple regression?

Study These Flashcards

Detect via correlation matrix or Variance Inflation Factor (VIF).

VIFi = (1/(1-Ri^2)) with a regressions with our i-var as the response vs all others as predictor variables

VIF = 1 R^2=0, no correlation
under 5, R^2<0.8 moderate correlation
over 5, R^2 >0.8 highly correlated
over 10, R^2 >= 0.9, significant so need to correct by grouping two together possibly

Solutions: Remove/merge correlated predictors, use regularization (Ridge), or collect more data.

What is forward selection in model building?

Study These Flashcards

Start with no predictors.

Test each available predictor individually, add the one that gives the greatest improvement (e.g., in R^2).

Repeat until adding further predictors fails to significantly improve the model.

What is backward elimination in model building?

Study These Flashcards

Start with all predictors.

Remove the least significant predictor (highest p-value or minimal improvement in R^2).

Repeat until all remaining predictors are significant or further removal degrades the model.

Why are forward selection and backward elimination important?

Study These Flashcards

They’re stepwise approaches to reduce a large set of predictors to a more parsimonious model.
They prevent overfitting by removing variables that add little predictive power.

What is the logistic regression model formula?

Study These Flashcards

ln(p/(1-p))= B0+B1x1+….Bpxp with p being the probability of the “success” class

We use this for 2-level categorial variables

Use logit(pi)=ln(pi/(1-pi) above which we use to find pi=3.142(e^B)/(1+e^B)

Why do we use logistic regression instead of linear regression for binary outcomes?

Study These Flashcards

Binary data often violate linear regression assumptions.

Logistic regression constrains predictions between 0 and 1.

The log-odds transformation (logit) is compatible with a wide range of distributions and yields interpretable odds ratios.

How are logistic regression parameters estimated?

Study These Flashcards

By using maximum likelihood, where we find the B that maximizes this likelihood

Interpret the slope Bj in logistic regression

Bj is the change in the log-odds of success for a 1-unit increase in xj, holding the other predpictors constant The odds ratio for a 1-unit increase in xj is exp(Bj)

What is k-fold cross-validation, and why do we use it in regression?

In k-fold CV, the dataset is split into k-folds. We then train on k-1 folds and validate on the left-out odd We then average the performance by using RMSE or R2 across the folds This approach provides a better estimate of the model performance and helps to dected overfitting

How do we evaluate a regression model’s predictive power using cross-validation?

Train on each folds complement (k-1) folds and predict on the held-out fold Compute RMSE or MAE or R2 for each fold Avergae these metrics over all k folds to get a CV performance measure

What does it mean to say a regression coefficient is “statistically significant”?

It means that, under a chosen significance level (like α=0.05), the null hypothesis “coefficient = 0” is rejected based on the data. In other words, there’s sufficient evidence that the predictor truly contributes to explaining variability in y.

How do residual plots help diagnose issues in linear regression?

A random scatter of points suggests that linear assumptions and homoscedasticity might hold. A pattern (e.g., curved shape, funnels) indicates possible non-linearity or heteroskedasticity. Systematic patterns can also suggest outliers or missing predictors.

When would you consider using regularization methods (like Ridge or Lasso) instead of standard OLS?

When you have high-dimensional data (many predictors) and risk overfitting. When you need to shrink coefficients or perform variable selection (especially with Lasso). When multicollinearity is severe, Ridge can reduce variance in coefficient estimates.

What is data permutation

Like shuffling data such that one column is held and the other is randomly arranged, to use for hypothesis testing

What can be said if the CI surrounds/includes zero?

There is not enough evidence to reject the null hypothesis then

2nd set Flashcards

(33 cards)