flashcard 7
(35 cards)
What is a statistical model, and why are regression models commonly used?
A statistical model is a simplified representation of reality that characterizes the association structure in data. Regression models are common because they explicitly model how one or more predictor (independent) variables relate to an outcome (dependent) variable, allowing quantification of those relationships.
How does simple linear regression differ from multiple linear regression?
Simple linear regression uses one predictor variable to model the outcome as a linear function, whereas multiple linear regression includes two or more predictors, still assuming a linear relationship between each predictor and the outcome.
What key assumption underlies the use of linear regression regarding the relationship between variables?
The primary assumption is that the expected value of the outcome (Y) is a linear function of the predictor(s) (X), meaning that changes in X correspond to proportional changes in Y on average.
In linear regression, what do the intercept (α) and slope (β) represent?
The intercept (α) represents the expected value of the outcome when all predictors equal zero. Each slope (β) represents the expected change in the outcome for a one-unit increase in the corresponding predictor, holding other predictors constant.
Why is it important that the error terms (ε) in linear regression are independent and identically distributed with mean zero?
Independence ensures that errors for different observations do not influence each other, and identically distributed with mean zero implies that deviations around the fitted line are random, unbiased, and have constant variance; violating these can lead to incorrect inferences.
What does homoscedasticity mean in the context of linear regression?
Homoscedasticity means that the variance of the error terms (residuals) remains constant across all levels of the predictor variables. In other words, the spread of residuals does not systematically increase or decrease as fitted values change.
Why must predictor variables in a multiple linear regression not be highly correlated with one another?
High correlation among predictors (multicollinearity) inflates the variance of coefficient estimates, making them unstable and difficult to interpret, and can obscure the unique contribution of each predictor.
How is the coefficient of determination (R²) interpreted in linear regression?
R² represents the proportion of total variability in the outcome that is explained by the fitted regression model.
For example, R² = 0.60 means 60% of the outcome’s variance is accounted for by the predictors.
What purpose does the adjusted R² serve compared to the ordinary R²?
Adjusted R² penalizes R² for adding predictors that do not meaningfully improve model fit. It adjusts for the number of predictors relative to sample size, helping to prevent overfitting by reflecting model complexity.
Why can linear regression be seen as a generalization of the t-test and ANOVA?
A t-test compares the means of two groups, and ANOVA compares means across multiple groups. In linear regression, categorical group membership can be encoded as indicator variables, so a regression model with a group indicator tests the same mean differences.
What is the role of the standard error (SE) of a regression coefficient?
The standard error measures how precisely the regression coefficient is estimated, reflecting the variability of the estimate across hypothetical repeated samples. Smaller SE indicates more precise estimation.
How do you use a t-test within linear regression to assess whether a coefficient is statistically significant?
You compute a t-statistic as (β / SE(β)) and compare it to the appropriate t-distribution. A large magnitude of t (and corresponding low p-value) indicates the coefficient differs significantly from zero, suggesting a meaningful association.
Describe how to interpret a 95% confidence interval (CI) for a regression coefficient.
A 95% CI is the range that, under repeated sampling, would contain the true coefficient 95% of the time. If the CI does not include zero, it indicates a statistically significant association at the 5% level.
Why is the value 1.96 used when constructing a 95% CI for a normally distributed estimate?
In a standard normal distribution, approximately 95% of values lie within ±1.96 standard deviations from the mean. Thus, multiplying SE by 1.96 produces the bounds for a 95% CI.
How can a standard curve (calibration curve) in analytical chemistry rely on linear regression?
A standard curve is formed by measuring a known analyte’s signal (e.g., absorbance) at various concentrations. Linear regression fits a straight line relating concentration to signal. Unknown samples’ signals are then placed on the fitted line to estimate their concentrations.
What distinguishes logistic regression from linear regression in terms of the outcome variable?
Logistic regression models a binary or categorical outcome (e.g., disease yes/no), whereas linear regression models a continuous outcome. Logistic regression uses the logit (log-odds) link to ensure predicted probabilities range between 0 and 1.
In logistic regression, what does the logit function (log-odds) accomplish?
The logit function transforms a probability p∈(0,1) to the entire real line (–∞, +∞) via log(p / (1–p)). This linearizes the relationship between predictors and log-odds, allowing fitting with standard linear model techniques.
How is an odds ratio (OR) derived from a logistic regression coefficient (β)?
The odds ratio for a one-unit increase in a predictor is exp(β). If β = 0.5, then OR = e⁰·⁵ ≈ 1.65, meaning the odds of the outcome are 1.65 times higher per unit increase in that predictor.
When is an odds ratio (OR) considered statistically significant based on its 95% CI?
If the 95% CI for the OR does not include 1, the association is statistically significant at the 5% level. Inclusion of 1 implies no significant difference in odds between comparison groups.
Why are odds ratios used in retrospective (case–control) studies instead of risk ratios?
In case–control studies, the total number of exposed individuals in the population is unknown, so absolute risks cannot be calculated. However, odds within cases and controls are observable, allowing OR computation.
Under what conditions can risk ratios (RR) be used instead of odds ratios (OR)?
Risk ratios can be calculated in prospective (cohort) studies where both the number of events and total individuals at risk in each exposure group are known, allowing direct computation of incidence proportions.
How do odds ratios and risk ratios relate when the event of interest is rare?
When the event is rare (i.e., incidence is very low), the numerical values of OR and RR are very close because odds and probabilities converge when probabilities are small.
What key assumptions are shared by both linear and logistic regression models?
Both models assume that observations are independent, that predictors are measured without error, and that the specified link function (identity for linear, logit for logistic) correctly describes how predictors relate to the transformed outcome.
Why is it incorrect to infer causation solely from an observed association in regression or correlation analyses?
Regression and correlation reveal statistical association but do not control for unmeasured confounders or establish temporal precedence. Establishing causation requires controlled experimental design, such as randomized controlled trials, to rule out alternative explanations.