Probabilistic Linear Regression Flashcards
Chapter 2
- In the probabilistic approach to linear regression, we say the true model is t = wᵀx + ε. How does assuming a Gaussian distribution for ε help us quantify predictions and parameter uncertainty?
By modeling the noise ε as Gaussian with mean 0 and variance σ², we can derive likelihood expressions and use them to find parameter estimates that maximize the probability of the observed data. This also yields formulas for variances and covariances of those estimates, enabling confidence intervals and uncertainty quantification for predictions.
- Given a dataset of pairs (xₙ, tₙ) and a linear model tₙ = wᵀxₙ + εₙ with Gaussian noise N(0, σ²), how do we write down the likelihood for the entire dataset?
We assume the data points are independent. The total likelihood is the product of individual Gaussian likelihoods for each point: L = ∏ₙ p(tₙ|w, σ²) = ∏ₙ N(tₙ | wᵀxₙ, σ²). Typically, we maximize the log-likelihood, log L = ∑ₙ log N(tₙ | wᵀxₙ, σ²).
- Show an example problem: Suppose we have 5 data points {(xₙ, tₙ)} in one dimension (x ∈ ℝ) with t = w₀ + w₁x + ε. How would you set up and solve for w₀, w₁, and σ² using maximum likelihood?
(1) Form the design matrix X with a column of 1s and a column of x-values. (2) Write down the log-likelihood assuming p(t|X,w,σ²) = N(Xw, σ²I). (3) Differentiate wrt w to get ŵ = (XᵀX)⁻¹ Xᵀ t. (4) Plug ŵ back into the expression for σ² to find σ̂² = (1/N) (t - Xŵ)ᵀ (t - Xŵ).
- How does the probabilistic view differ from the classical least-squares view in linear regression, and why do both end up yielding the same formula for ŵ?
The probabilistic view interprets the noise as Gaussian and maximizes the likelihood of observing the data; the classical least-squares view minimizes the sum of squared residuals. Both yield the same normal equations because maximizing exp(-residual²/(2σ²)) is equivalent to minimizing residual², leading to ŵ = (XᵀX)⁻¹ Xᵀ t.
- Why is modeling the errors εₙ explicitly so important if all we end up doing is the same normal equation solution?
Because it lets us quantify uncertainty. Merely minimizing sums of squares finds ŵ, but the probabilistic view also gives us estimates for σ², confidence intervals for ŵ, and predictive distributions for new t-values. This is crucial for risk assessment and understanding the reliability of our predictions.
- Consider predicting the 2012 men’s 100m Olympic winning time with a linear model. The data has random noise around a trend. How does including a Gaussian noise term ε with variance σ² inform our confidence in that 2012 prediction? (What is var{t_new} = )
By estimating σ² from historical data, we can compute var{t_new} = σ² x_newᵀ (XᵀX)⁻¹ x_new. This formula shows how uncertainty in the parameters propagates to uncertainty in the new prediction. When x_new is far from the bulk of the training data, the uncertainty grows.
- What does it mean that the maximum-likelihood estimate ŵ is ‘unbiased,’ and how do we formally show it in the probabilistic regression model?
‘Unbiased’ means E[ŵ] = w, i.e., on average across many datasets generated from the true w, the estimate ŵ recovers that true w. Formally, we show E[ŵ] = E[(XᵀX)⁻¹Xᵀ t] = (XᵀX)⁻¹Xᵀ E[t], and E[t] = Xw. Hence E[ŵ] = w.
- Provide an example problem showing why σ̂² = (1/N)(t - Xŵ)ᵀ(t - Xŵ) is biased for small samples. What is the exact expectation of σ̂² under the true model?
If we simulate tₙ = wᵀxₙ + Gaussian(0, σ²) for a small dataset, we typically find σ̂² < σ². Formally, E[σ̂²] = σ² (1 - D/N), so σ̂² is systematically lower than σ² when D < N. That is because ŵ itself is fit to the same data, artificially reducing the residual sums.
- If our dataset has N points and D parameters, what is the intuitive explanation for why σ̂² underestimates the true variance σ² by the factor (1 - D/N)?
When we solve for ŵ, the regression line fits part of the data ‘too well,’ because ŵ is chosen to minimize residuals. We use the same data to estimate σ², so the apparent residual variance is reduced. We lose D degrees of freedom in matching w, thus we correct by the factor (1 - D/N).
- Show a short practical problem: If we have D=2 parameters and N=10 data points, how would you adjust the biased estimator σ̂² to get an unbiased estimate of σ²?
Since E[σ̂²] = σ²(1 - D/N) = σ²(1 - 2/10) = σ²(0.8), you can multiply σ̂² by 1/(1 - 2/10) = 1/0.8 = 1.25 to get an unbiased estimate of σ². In other words, use σ̃² = (N/(N - D)) σ̂².
- How do we calculate the covariance of the estimated parameters, cov{ŵ}, and what does it tell us about ŵ?
Under the Gaussian noise assumption, cov{ŵ} = σ² (XᵀX)⁻¹. The diagonal elements show how much each parameter can vary; large diagonal entries mean low precision (high uncertainty). Off-diagonal elements indicate correlation between parameters (how they move together to maintain a good fit).
- Provide an example problem where x-values are very close together, and show how it affects (XᵀX) and thus cov{ŵ}.
If all x-values in a dataset are nearly the same, XᵀX becomes nearly singular, making (XᵀX)⁻¹ huge. Numerically, suppose x = [1.0, 1.1, 1.05,…], then the design matrix columns are almost linearly dependent. The result is a very large cov{ŵ}, meaning the parameters are not well identifiable from the data.
- How do we form a predictive distribution for a new input x_new based on ŵ and σ̂², and what is the variance of that prediction? (What is the variance of a new datapoint prediction)
We have t_new = wᵀx_new + ε, and w is estimated by ŵ. The predictive variance is var{t_new} = σ̂² + x_newᵀ cov{ŵ} x_new = σ̂² + σ̂² x_newᵀ (XᵀX)⁻¹ x_new = σ̂²[1 + x_newᵀ (XᵀX)⁻¹ x_new]. The first term reflects noise in the outcome; the second term is uncertainty in ŵ.
- Give a mini-challenge: Suppose you train a polynomial model of degree 3 and a polynomial model of degree 8 on the same data. Both produce a best-fit line, but the degree-8 model has huge cov{ŵ}. Why might the simpler model sometimes yield tighter predictions far from the training set?
High-degree polynomials can ‘bend’ to fit noise, leading to poor identifiability of coefficients (large cov{ŵ}). When extrapolating far from the data, small changes in high-degree coefficients can swing predictions wildly. By contrast, the simpler degree-3 model might maintain stable coefficients and hence smaller predictive variance far from training points.
- How does cross-validation still remain a reliable method for model selection even when we switch from a purely least-squares perspective to a probabilistic (Gaussian) perspective?
Cross-validation directly tests predictive performance on held-out data. While a probabilistic approach gives parameter uncertainties and likelihoods, over-complex models can inflate the training likelihood. CV bypasses that by empirically assessing out-of-sample error, helping you choose a balance between complexity and generalization.
- Show a short real-world scenario: You want to forecast next year’s sales based on advertising spend, price, and competitor data. Explain how the probabilistic approach with linear regression would inform your confidence in the forecast.
1) Collect (xₙ, tₙ). 2) Fit ŵ and σ̂² by maximizing Gaussian likelihood. 3) Use cov{ŵ} = σ̂² (XᵀX)⁻¹ to measure parameter uncertainty. 4) Predict next year’s sales with t_new = ŵᵀ x_new, but also compute var{t_new} = x_newᵀ cov{ŵ} x_new + σ̂². This gives a predictive range and expresses how certain or uncertain the model is about next year’s sales.
- Why does the predictive variance var{t_new} typically increase the further x_new is from the bulk of the training data?
Because x_newᵀ (XᵀX)⁻¹ x_new grows when x_new lies farther from where the design matrix X provides strong coverage. This term inflates the total predictive variance. With fewer nearby data to anchor the fit, small changes in ŵ are magnified, increasing overall uncertainty.
- Suppose you have a linear model t = wᵀx + ε, and you sample many parameter vectors q from N(ŵ, cov{ŵ}). How does examining the distribution of t_new = qᵀ x_new help?
By taking many q samples, you see how t_new varies across plausible parameter values consistent with your data. This produces a distribution over t_new, illustrating all likely outcomes rather than a single point estimate. It’s a form of approximate Bayesian model averaging within the maximum-likelihood framework.
- How would you apply the probabilistic linear regression framework to the men’s 100m sprint times if you suspected multiple features (year, track condition, temperature)? Provide an example approach.
(1) Collect historical data: each instance xₙ = (1, yearₙ, track_conditionₙ, temperatureₙ), and tₙ = winning_timeₙ. (2) Construct X and solve ŵ = (XᵀX)⁻¹ Xᵀ t. (3) Estimate σ̂² from residuals. (4) For any new scenario (year, track condition, temperature), predict time as ŵᵀ x_new and compute the predictive variance. This handles multiple input dimensions in the same Gaussian-likelihood framework.
- Summarize in your own words the key benefits of taking a probabilistic approach to linear regression and how it improves on plain least squares when dealing with real data with uncertainty.
It provides (1) a natural way to estimate and interpret noise variance, (2) a principled derivation of parameter estimates as maximum likelihood, (3) formulas for parameter uncertainty (cov{ŵ}), and (4) predictive distributions (with means and variances) for new inputs. These enhancements are crucial for risk management, confidence intervals, and any scenario where understanding uncertainty is as important as the prediction itself.
What is the formula for a linear model?
tn = w0 + w1xn,1 + w2xn,2 + w3xn,3 + … + wDxn,D
Represents the prediction of response variable tn based on weighted inputs.
What does the vector xn represent in a linear model?
xn = [1, xn,1, xn,2, …, xn,D]
It includes a constant term and the input features.
What is the matrix X in the context of a linear regression model?
X = [[1, x1,1, x1,2, …, x1,D], [1, x2,1, x2,2, …, x2,D], …, [1, xN,1, xN,2, …, xN,D]]
It is the design matrix containing all input features for each observation.
What does the vector t represent in linear regression?
t = [t1, t2, …, tN]
It contains the actual response values for each observation.