flashcard 7

(35 cards)

1
Q

What is a statistical model, and why are regression models commonly used?

A

A statistical model is a simplified representation of reality that characterizes the association structure in data. Regression models are common because they explicitly model how one or more predictor (independent) variables relate to an outcome (dependent) variable, allowing quantification of those relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does simple linear regression differ from multiple linear regression?

A

Simple linear regression uses one predictor variable to model the outcome as a linear function, whereas multiple linear regression includes two or more predictors, still assuming a linear relationship between each predictor and the outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What key assumption underlies the use of linear regression regarding the relationship between variables?

A

The primary assumption is that the expected value of the outcome (Y) is a linear function of the predictor(s) (X), meaning that changes in X correspond to proportional changes in Y on average.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In linear regression, what do the intercept (α) and slope (β) represent?

A

The intercept (α) represents the expected value of the outcome when all predictors equal zero. Each slope (β) represents the expected change in the outcome for a one-unit increase in the corresponding predictor, holding other predictors constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is it important that the error terms (ε) in linear regression are independent and identically distributed with mean zero?

A

Independence ensures that errors for different observations do not influence each other, and identically distributed with mean zero implies that deviations around the fitted line are random, unbiased, and have constant variance; violating these can lead to incorrect inferences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does homoscedasticity mean in the context of linear regression?

A

Homoscedasticity means that the variance of the error terms (residuals) remains constant across all levels of the predictor variables. In other words, the spread of residuals does not systematically increase or decrease as fitted values change.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why must predictor variables in a multiple linear regression not be highly correlated with one another?

A

High correlation among predictors (multicollinearity) inflates the variance of coefficient estimates, making them unstable and difficult to interpret, and can obscure the unique contribution of each predictor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is the coefficient of determination (R²) interpreted in linear regression?

A

R² represents the proportion of total variability in the outcome that is explained by the fitted regression model.

For example, R² = 0.60 means 60% of the outcome’s variance is accounted for by the predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What purpose does the adjusted R² serve compared to the ordinary R²?

A

Adjusted R² penalizes R² for adding predictors that do not meaningfully improve model fit. It adjusts for the number of predictors relative to sample size, helping to prevent overfitting by reflecting model complexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why can linear regression be seen as a generalization of the t-test and ANOVA?

A

A t-test compares the means of two groups, and ANOVA compares means across multiple groups. In linear regression, categorical group membership can be encoded as indicator variables, so a regression model with a group indicator tests the same mean differences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the role of the standard error (SE) of a regression coefficient?

A

The standard error measures how precisely the regression coefficient is estimated, reflecting the variability of the estimate across hypothetical repeated samples. Smaller SE indicates more precise estimation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you use a t-test within linear regression to assess whether a coefficient is statistically significant?

A

You compute a t-statistic as (β / SE(β)) and compare it to the appropriate t-distribution. A large magnitude of t (and corresponding low p-value) indicates the coefficient differs significantly from zero, suggesting a meaningful association.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe how to interpret a 95% confidence interval (CI) for a regression coefficient.

A

A 95% CI is the range that, under repeated sampling, would contain the true coefficient 95% of the time. If the CI does not include zero, it indicates a statistically significant association at the 5% level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is the value 1.96 used when constructing a 95% CI for a normally distributed estimate?

A

In a standard normal distribution, approximately 95% of values lie within ±1.96 standard deviations from the mean. Thus, multiplying SE by 1.96 produces the bounds for a 95% CI.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can a standard curve (calibration curve) in analytical chemistry rely on linear regression?

A

A standard curve is formed by measuring a known analyte’s signal (e.g., absorbance) at various concentrations. Linear regression fits a straight line relating concentration to signal. Unknown samples’ signals are then placed on the fitted line to estimate their concentrations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What distinguishes logistic regression from linear regression in terms of the outcome variable?

A

Logistic regression models a binary or categorical outcome (e.g., disease yes/no), whereas linear regression models a continuous outcome. Logistic regression uses the logit (log-odds) link to ensure predicted probabilities range between 0 and 1.

17
Q

In logistic regression, what does the logit function (log-odds) accomplish?

A

The logit function transforms a probability p∈(0,1) to the entire real line (–∞, +∞) via log(p / (1–p)). This linearizes the relationship between predictors and log-odds, allowing fitting with standard linear model techniques.

18
Q

How is an odds ratio (OR) derived from a logistic regression coefficient (β)?

A

The odds ratio for a one-unit increase in a predictor is exp(β). If β = 0.5, then OR = e⁰·⁵ ≈ 1.65, meaning the odds of the outcome are 1.65 times higher per unit increase in that predictor.

19
Q

When is an odds ratio (OR) considered statistically significant based on its 95% CI?

A

If the 95% CI for the OR does not include 1, the association is statistically significant at the 5% level. Inclusion of 1 implies no significant difference in odds between comparison groups.

20
Q

Why are odds ratios used in retrospective (case–control) studies instead of risk ratios?

A

In case–control studies, the total number of exposed individuals in the population is unknown, so absolute risks cannot be calculated. However, odds within cases and controls are observable, allowing OR computation.

21
Q

Under what conditions can risk ratios (RR) be used instead of odds ratios (OR)?

A

Risk ratios can be calculated in prospective (cohort) studies where both the number of events and total individuals at risk in each exposure group are known, allowing direct computation of incidence proportions.

22
Q

How do odds ratios and risk ratios relate when the event of interest is rare?

A

When the event is rare (i.e., incidence is very low), the numerical values of OR and RR are very close because odds and probabilities converge when probabilities are small.

23
Q

What key assumptions are shared by both linear and logistic regression models?

A

Both models assume that observations are independent, that predictors are measured without error, and that the specified link function (identity for linear, logit for logistic) correctly describes how predictors relate to the transformed outcome.

24
Q

Why is it incorrect to infer causation solely from an observed association in regression or correlation analyses?

A

Regression and correlation reveal statistical association but do not control for unmeasured confounders or establish temporal precedence. Establishing causation requires controlled experimental design, such as randomized controlled trials, to rule out alternative explanations.

25
How does multicollinearity affect the interpretation of logistic regression coefficients?
Multicollinearity (high correlation between predictors) inflates the standard errors of logistic regression coefficients, making OR estimates unstable and reducing confidence in interpreting any single predictor’s effect.
26
Explain why logistic regression uses maximum likelihood estimation rather than ordinary least squares.
Because the outcome is binary, residuals are not normally distributed around a continuous response. Maximum likelihood estimation finds the parameter values that maximize the probability of observing the given binary outcomes under the logistic model.
27
What is the practical difference between a probability and an odds when interpreting logistic regression?
Probability is the chance of an event happening ([0,1] range). Odds are the ratio of the probability of the event to the probability of no event (p / (1–p)). Odds greater than 1 indicate p > 0.5, and vice versa.
28
How do confidence intervals around odds ratios inform the reliability of estimated associations?
The CI width reflects uncertainty: a narrow CI indicates a precise OR estimate, while a wide CI indicates less precision. If the entire CI lies above (or below) 1, it confirms a reliably increased (or decreased) odds of the outcome.
29
Why might logistic regression include continuous and categorical predictors simultaneously?
Logistic regression can model a binary outcome as a function of any mix of predictor types, allowing adjustment for confounding variables, capturing nuanced relationships, and improving predictive accuracy by using all relevant information.
30
What does an R²-like measure (e.g., pseudo-R²) indicate in logistic regression?
Pseudo-R² measures (such as McFadden’s R²) quantify how well the logistic model explains outcome variability in a way analogous to linear R². Although they do not have exactly the same interpretation, higher values indicate better model fit.
31
How does one interpret a regression coefficient of zero in both linear and logistic regression?
In linear regression, a coefficient of zero implies no change in the continuous outcome per unit change in predictor (no association). In logistic regression, a coefficient of zero implies an OR of 1 (no change in odds of the outcome), similarly indicating no association.
32
When constructing a standard curve via linear regression, why must one check that the relationship between concentration and signal is approximately linear?
Because linear regression assumes a linear relationship; if the calibration points do not align linearly (e.g., at very high concentrations saturation occurs), the fitted line will not accurately predict unknown concentrations, leading to biased estimates.
33
Why is the concept “association does not imply causation” emphasized when teaching regression and correlation?
Because it reminds practitioners that observed statistical relationships might arise from confounding variables, reverse causality, or chance, and that only controlled experimental designs can rigorously establish a causal link.
34
How can one use the fitted logistic regression model to predict an individual’s probability of an outcome?
First calculate the linear predictor (η = α + β₁ X₁ + … + βₖ Xₖ), then transform via the logistic function: probability = 1 / [1 + exp(–η)]. This yields a value between 0 and 1 representing the predicted likelihood of the event.
35
Explain why one would examine confidence intervals of regression coefficients rather than rely solely on p-values.
Confidence intervals convey both statistical significance (via inclusion/exclusion of a null value) and the magnitude and precision of effect estimates. P-values alone do not show how large or precise an effect is, only whether it is unlikely to be zero by chance.