Concepts Flashcards
(24 cards)
What are 3 parts of expected test error?
squared Bias Error
Variance Error
irreducible Error
What is bias in a model?
Bias refers to the error that is introduced by approximating a real-life problem (which could be complicated) by a simpler model. It’s the difference between the model’s prediction and the true value of the parameter being predicted.
As bias increases, flexibility of the model reduces.
Causes types of errors frequently thought of as “underfitting” -> high bias might generalize well at the cost of not capturing important nuances in data
Example of models with high bias.
Linear regression, LDA, Logistic Regression
Examples of models with low bias.
Decision Trees, Random Forest, XgBoost, KNN, SVM
What is prediction in ML?
Use the existing data to build an estimator to predict outcomes for new data points. The accuracy of predictions will depend on two errors - reducible and irreducible error.
What is inference in ML?
We are interested in understanding the data generating process i.e. the association between response and predictor values. Estimator can’t be treated as a black box and model interpretability is very important.
imp points to consider -
1. subset of predictors associated with response
2. relationship between predictor and response. positive/negative/might depend on other predictors
3. Is the relationship between predictor and response linear or more complex?
Models - restrictive models like linear regression
What is parametric learning algorithm?
Algorithm that simplify the data generating process to a known form. Reduces the problem of estimating f to estimating a set of parameters.
Two steps - 1. make assumption of functional form or shape of the estimator
2. estimate parameters using training data
number of parameters is fixed with respect to the training data.
Pros and cons of parametric learning algorithms.
Pros
1. Simple, easy to understand and interpret results
2. Speed - limited set of parameters
3. Less Data - can work well even if the data fit is not perfect
Cons
1. Constrained - constrained to the chosen function form
2. Limited Complexity
3. Poor fit - unlikely to match the underlying mapping function
What is non-parametric learning method?
Do not make strong assumptions about the functional form of f. Seek to best fit the training data in constructing the mapping function while maintaining the generalization ability to unseen data.
Pro/Con of non-parametric learning method
Pro - 1. flexibility - capable of fitting various functional forms. 2. Can result in high performance 3. Powerful - no assumption Con - 1. need large training data 2. slower to train 3. might overfit and less interpretable eg. KNN
Trade off between model flexibility and interpretability
Define variance.
Variance refer to amount by which estimator would change if we estimate it using a different training data. If a method has high variance, small change in training data can result in large change in f. More flexible methods have high variance. eg - KNN, Decision Trees
** causing of types of errors frequently thought of as “overfitting” -> capture most nuances in training data including noise that will reduce its capability to generalize on unseen data.
How does bias/variance change with model flexibility?
As we use more flexible models, variance will increase while bias will decrease. Relative rate of change between these two will decide the change in test error. As we increase flexibility, initially bias will decrease faster than increase in variance. But at some point increasing flexibility will have little impact on bias but starts to significantly increase the variance. When this happens, test error increases.
Bias/Variance Trade Off.
Test error - squared bias, variance, irreducible error
good estimator - low test error - low variance, low bias
trade off because easy to find method with high bias and low variance or low bias and high variance. challenge lies in finding optimal method.
also talk about model flexibility and bias/variance.
Define resampling method in ML.
Resampling involves repeatedly drawing samples from a training set and refitting a model on each sample in order to obtain information about the fitted model. It can be useful in obtaining additional information which might not be available if the model was fitted only once on the training set. Useful when training set is small.
Cons - computationally expensive
Two methods - cross-validation, bootstrapping
Cons of using one validation set.
- Validation estimate of test error rate is highly variable, precisely depending upon which observations were used to train the model and which were used for validation.
- Only a subset of observations are used for fitting a model. Since ML model perform worse when only a small number of observations are used for training, the validation error rate might overestimate the test error rate for the model fit on the entire dataset.
Explain LOOCV.
Involves splitting the observations into two parts - model is fit on n-1 data points and one data point is used for prediction and calculating test error. This is repeated n times, where each data point is used as validation set once and remaining n-1 observations are used for model training. This gives n test errors which are then averaged.
k-fold cv with k=1
gives unbiased estimate of test error. performing LOOCV multiple times will give same error. computationally expensive.
Explain K cross fold CV
Training data is divided into k folds of approx equal size. Model is trained on k-1 fold and kth fold is used for predictions. This is repeated k times and it results in k estimates of test error. Final error is avg of all k errors.
Compare LOOCV and k cross fold validation. or (bias variance trade off in cv)
- LOOCV is computationally expensive since n models are trained.
- LOOCV gives unbiased estimate of error (since n-1 observations are used for training). k CV gives intermediate biased estimate.
- High variance in LOOCV estimate.
Why is variance of estimated test error high in LOOCV vs kCV?
For LOOCV, we average outputs of n fitted models each of which is trained on an identically similar dataset; there fore these outputs are highly correlated. In contrast, for kCV we average output of k models which are somewhat less correlated since the overlap between the training set of each model is smaller. Since the mean of highly correlated quantities has high variance then the mean of less correlated quantities, hence the variance of estimated test error is high when we use LOOCV.
Explain bootstrapping sampling method.
Bootstrap sampling method can be used to quantify uncertainty associated with an estimate. From a data set, we randomly pick n points in order to produce one bootstrap set. We pick the points by replacement i.e. each point can be appear more than once in one bootstrap set. SE can be calculated for estimates from these bootstrapped samples which can serve as an estimate of the standard error of the variable in the original set.
Variable importance in bagged trees.
total amount of reduction in RSS due to splits over a predictor averaged over all trees.
Bagged trees vs random forest
Random forest - similar to bagging, we build a number of trees using bootstrapped sample. however, during creation of trees, for splitting a node we use a random subset of predictors as split candidate from full set of predictors. (fresh subset for each split)
advantage - if there is a strong predictor, all trees in the bagging will use it as top split and hence will be highly correlated. resulting in high variance in the prediction.
rf - (p-m)/p splits will not consider the strong predictor. -> docorrelated trees and hence predictions are more reliable.
Boosted Trees
Trees are grown sequentially each learning through using information from previously grown trees. given a current model, next tree is fit on using the residuals from the current model. Add the new tree in the current model and update the residuals.