Complexity/Selection/Regularization Flashcards
(35 cards)
What is the bias-variance decomposition?
E[(Y - f̂_D(x))^2] = (Bias^2) + (Variance)
What does high bias in a model indicate?
A bias towards a particular kind of solution (e.g., linear model), also known as inductive bias.
What does high variance in a model indicate?
: The estimated model changes significantly when trained on different data sets, indicating overfitting.
What is the VC dimension?
The maximum number of points that can be correctly classified by at least one member of a set of classifiers.
What is the VC dimension of a linear classifier on R^p?
VC = p + 1
How are Degrees of Freedom (DF) defined for an estimate ŷ = f̂(X)?
Back: df(ŷ) = (1/σ^2) Σ cov(ŷ_i, y_i) = (1/σ^2) tr(cov(ŷ, y))
What is an intuition/interpretation of degrees of freedom in model complexity defined for an estimate ŷ = f̂(X)??
Degrees of freedom in model complexity represent the number of independent parameters that can be adjusted to fit the data, influencing the model’s flexibility.
Relationship between complexity, bias, variance and total error
The higher the model complexity, the lower the bias, the higher the variance and total error does a convex in between.
What does a good bias require?
Domain knowledge.
What is the relationship between degrees of freedom, number of samples, number of features, and λλ in ridge regression?
In ridge regression, the degrees of freedom are influenced by the number of features and the penalty parameter λλ. As λλ increases, the degrees of freedom decrease, reducing model complexity and helping to prevent overfitting, especially when the number of samples is limited relative to the number of features.
What is the PRESS statistic?
Predicted Residual Error Sum of Squares: PRESS = Σ(y_i - ŷ_-i)^2, where ŷ_-i is the prediction for the i-th sample when the model is estimated on all but the i-th sample.
Higher model complexity effect on performance? What to do?
It will always lead to better fit on training data but not on test data therefore we need to select a model by estimating its performance for train and validation/test data.
What is a method for validation?
Cross validation: estimate generalization error on different train/test splits.
What is Leave-one-out Cross-Validation (LOO-CV)?
A method where the model is trained on all but one sample and tested on the left-out sample, repeated for all samples. The average prediction error is reported.
What is k-fold Cross-Validation?
A method where data is split into k subsets, the model is trained on k-1 subsets and tested on the remaining one, repeated k times. Often k=5 or k=10 is used in practice.
What is the relationship between expected prediction error and expected training error?
expected pred error = expected train error + constant (>0) * DF(model)
What is a better selection criteria?
training error + complexity penalty. The larger the complexity (which tends to increase train error and lower generalization), the larger the penalty to keep it down!
If two models fit the data equally well? which to select?
One with less complexity!
What is the problem with cross validation?
If we test too many models, we will reach overfitting..
What is the Bayes factor in model selection?
The ratio of marginal likelihoods: pr(x|m_i) / pr(x|m_j)
What is the Bayes Information Criterion (BIC)?
BIC(x;m) = -2 log pr(x|θ̂,m) + p log(n), where θ̂ is the maximum likelihood estimate, p is the number of parameters, and n is the number of samples.
What is the Fisher Information Approximation (FIA)?
FIA(x;m) = -log pr(x|θ̂,m) + (p/2)log(n/2π) + log C_m, where C_m is the geometric complexity
What is the objective of l_k-penalized regression?
Minimize ω(θ) + λ||θ||_k^k, where ω(θ) is the loss function and ||θ||_k is the l_k norm of the parameters.
What is the objective of l_k-penalized regression?
As the number of parameters increases past the interpolation threshold, test error first increases then decreases again, forming a double descent curve.