Model Selection Flashcards

Question 1

Q

What is the No Free Lunch theorem? What is its implication?

Answer

A

The theorem states that when average over all possible learning problems, every learning algorithm achieves the same generalization accuracy. The implication is that no leaner is universally superior

Question 2

Q

How can we decompose the prediction error? Explain each part.

Answer

A

The expected squared error on a new sample can be decomposed into three components:
- Irreducible error: error inherent to the data itself. No matter how perfect the model is we can not predict this noise.
- Bias: introduced by approximating the true underlying function with our models expected prediction. A high Bias indicates that our model is too simple to capture the true patterns in the data. A bias is low when the model is flexible enough to approximate the underlying function closely
- Variance: measures how much the models predictions vary from different training sets. A high variance indicates that the model is very sensitive to small fluctuations in the training data, this often happens with high complex models. Low variance in the other hand indicates that the model is relatively stable across different training sets

Question 3

Q

What is the trade off between bias and variance?

Answer

A

More complex models typically reduce bias but increase variance, while simpler models reduce variance at the cost of higher bias

Question 4

Q

What is the curse of dimensionality?

Answer

A

In high dimensional spaces, the volume of the space increases exponentially with the number of dimensions. This way data points become increasingly sparse relative to the volume of the space. This sparsity means that even large datasets may not adequately cover the space, making it difficult to identify reliable patterns or statistical relationships.

In high dimensions the distance between nearest and farthest neighbors become negligible.

Question 5

Q

What are the methods for model selection?

Answer

A

They are:
- feature selection: identifying a subset of relevant features.
- regularization: techniques such as lasso and ridge shrink the coefficients towards zero to reduce variance
- dimension reduction: methods like PCA transform the original features into a lower dimensional subspace that capture most of the variance

Question 6

Q

What are the Ensemble methods? Which are the Ensemble methods? Explain each of them. Which circumstances each is more suitable to? How each method achieve its own goal?

Answer

A

Ensemble methods aim to improve prediction accuracy by combining multiple methods, thus reducing variance or bias

Ensemble methods:
- bagging: designed to reduce variance without increasing bias. It is particularly useful for models that are highly sensitive to the training data (models with high variance)
- boosting: is designed to reduce bias. It works by sequentially training models where each model corrects the error of its processor

Question 7

Q

Which ensemble method can be parallelized? Why the other can not?

Answer

A

The bagging ensemble method can be parallelized.

Boosting has to be executed in sequence once each change in the weights for the following training depends on the result of the previous trainning

Question 8

Q

How does bagging work? Why it reduces variance?

Answer

A

Bootstrap sampling: generate multiple training datasets by sampling with replacement from the original dataset
Train multiple models: train a separate model on each bootstrap sample
Aggregate the predictions: for classification use majority vote, for regression use average

By averaging multiple predictions, the overall variance of the ensemble is lower than that of the individual models

Question 9

Q

How does boosting work? What is a key limitation of boosting?

Answer

A

Train a weak model
- Start with a simple model
- evaluate errors on the training data
Focus on hard to classify examples
- increase the weights of misclassified points so that the next model pays more attention to them
Train a new model on the updated data: repeat the process, training a series of models where each corrects mistakes made by the previous ones
Combine the models: weighted majority vote (classification) or weighted sum (regression) to make final predictions

It is more sensitive to noisy data, as it keeps focusing on misclassified points that might be outliers

Model Selection Flashcards

(9 cards)