Machine Learning Flashcards
What two different types of problems exist in ML?
Regression and classification.
Regression is when the output we search for is quantitative, for instance a temperature.
Classification is when the output is qualitative, for instance a {Healthy, Sick}
What is Ridge Regression?
Aka L2-regulated regression. A statistical regularization technique to prevent overfitting. It limits the weights to be to big.
What is the Confusion Matrix?
It is a way to visualize the results from binary classification in supervised ML. It shows the predicted labels and the true labels.
What is Logistic regression?
A specific algorithm that can be used to find the weights for a (often binary) classification problem.
What is the difference between parametric and non-parametric models?
A parametric model is trained to estimate a distribution function from the training data. This will allow the model to return outputs from the same distribution as the training data. Example of parametric models are linear regression (all regression). So here you use training data to train a function (model) whilst in non-parametric you directly use the data to map new test data and not a function.
Non-parametric models are trained to directly relate the input data to the output data, ex. kNN. These models will grow with the data, but can easily be overfitted and computationally heavy.
What is k-NN?
k-Nearest Neighbors is a simple non-parametric classifier where the classification is determined by the majority vote of the k nearest points. k controls bias-variance trade off
Need a lot of data to represent real world.
Not scalable to high dimensions, uses distance and the curse of dimensionality.
What is the difference between a discriminative model and a generative model?
A discriminative model describes how the output y is generated directly p(y|x)
A Generative model describes how both the input and the outputs are generated, i.e. p(x,y)
What is k-fold cross validation and what benifits are there with the method?
It is a way ot evaluate the model during training. Seperate the training data into k parts and then run the training k times where each run one of the the k folds are excluded and used as the validation test. Then take the average of all k validations to evaluate the model. Gives a more accurate evaluation since there is multiple “new” validation tests used and also you utalize all the data for training. Note that there might still be a need to keep a completely seperate test set.
What does the model complexity mean?
The models ability to adapt to patterns in the data. High complexity means it can adapt closer to the data compared to lower complexity.
When the complexity increases this means that the generalization gap (the difference between the training outputs and output on test data) will increase, this is due to the increased complexity of the model allowing more overfitting.
What is bias and variance in Machine Learning?
The error on new data (E) consist of the bias summed with the variance.
The bias is the part of E that is due to the model not being able to represent the true function. This can often be due to overfitting the data (i.e. the model doesn’t show the true relation) or the data being biased, thus misleading.
The other part of E is the variance which is due to the variability in the training dataset.
What does CART stand for?
Classification And Regression Trees.
The depth of the tree determines the flexibility/complexity of the tree. A deep tree has been sectioned a lot -> Low bias but large variance.
What is Bagging and why is it better than CART?
The idea here is to train N decision trees and then average the results from them to achieve the final model, called Bagging.
The problem is that we rarely have N different datasets, thus we use bootstraping. Bootstraping is refered to in statistics as the action of resampling with replacement from a set to create smaller datasets that still represents the same distribution.
What is Random Forest and how does it compare to CART and Bagging?
RF is another algorithm that is based upon bagging. When doing Bagging there is a correlation between the datasets which causes the decrease in variance from averaging these trees to deminish.
In order to de-correlate the trees RF will introduce randomly perturbing trees. Meaning that for each split in the tree only a random subset of the inputs will be considered as spliting variables. This will further de-correlate bagging and result in a more generalizable solution.
What are the computational advantages of Random Forest?
Fewer variables are handled during training which reduces computation cost and it is also easy to do in parallel(though also true for bagging).
What is Boosting?
This method trains multiple “weaker” models to combine into one “strong” model. The idea is that the weaker models all can capture some small part of the relationship between input and output, thus combining them should capture most of the relationship.
The procedure:
1. Each weak model is trained on weighted data and after one tree is trained the weights for the data missclassified are increased. Thus for the next tree those datapoints will be of higher priority to classify correctly.
2. Combine all the weak trees into once model using a weighted majority vote (for regression a weighted average?)
Boosting can be used for most supervised learning algorithms