Machine Learning Flashcards by Unknown Unknown

What two different types of problems exist in ML?

Regression and classification.
Regression is when the output we search for is quantitative, for instance a temperature.
Classification is when the output is qualitative, for instance a {Healthy, Sick}

How well did you know this?

Not at all

Perfectly

What is Ridge Regression?

Aka L2-regulated regression. A statistical regularization technique to prevent overfitting. It limits the weights to be to big.

How well did you know this?

Not at all

Perfectly

What is the Confusion Matrix?

It is a way to visualize the results from binary classification in supervised ML. It shows the predicted labels and the true labels.

How well did you know this?

Not at all

Perfectly

What is Logistic regression?

A specific algorithm that can be used to find the weights for a (often binary) classification problem.

How well did you know this?

Not at all

Perfectly

What is the difference between parametric and non-parametric models?

A parametric model is trained to estimate a distribution function from the training data. This will allow the model to return outputs from the same distribution as the training data. Example of parametric models are linear regression (all regression). So here you use training data to train a function (model) whilst in non-parametric you directly use the data to map new test data and not a function.

Non-parametric models are trained to directly relate the input data to the output data, ex. kNN. These models will grow with the data, but can easily be overfitted and computationally heavy.

How well did you know this?

Not at all

Perfectly

What is k-NN?

k-Nearest Neighbors is a simple non-parametric classifier where the classification is determined by the majority vote of the k nearest points. k controls bias-variance trade off

Need a lot of data to represent real world.

Not scalable to high dimensions, uses distance and the curse of dimensionality.

How well did you know this?

Not at all

Perfectly

What is the difference between a discriminative model and a generative model?

A discriminative model describes how the output y is generated directly p(y|x)

A Generative model describes how both the input and the outputs are generated, i.e. p(x,y)

How well did you know this?

Not at all

Perfectly

What is k-fold cross validation and what benifits are there with the method?

It is a way ot evaluate the model during training. Seperate the training data into k parts and then run the training k times where each run one of the the k folds are excluded and used as the validation test. Then take the average of all k validations to evaluate the model. Gives a more accurate evaluation since there is multiple “new” validation tests used and also you utalize all the data for training. Note that there might still be a need to keep a completely seperate test set.

How well did you know this?

Not at all

Perfectly

What does the model complexity mean?

The models ability to adapt to patterns in the data. High complexity means it can adapt closer to the data compared to lower complexity.
When the complexity increases this means that the generalization gap (the difference between the training outputs and output on test data) will increase, this is due to the increased complexity of the model allowing more overfitting.

How well did you know this?

Not at all

Perfectly

What is bias and variance in Machine Learning?

The error on new data (E) consist of the bias summed with the variance.

The bias is the part of E that is due to the model not being able to represent the true function. This can often be due to overfitting the data (i.e. the model doesn’t show the true relation) or the data being biased, thus misleading.

The other part of E is the variance which is due to the variability in the training dataset.

How well did you know this?

Not at all

Perfectly

What does CART stand for?

Classification And Regression Trees.
The depth of the tree determines the flexibility/complexity of the tree. A deep tree has been sectioned a lot -> Low bias but large variance.

How well did you know this?

Not at all

Perfectly

What is Bagging and why is it better than CART?

The idea here is to train N decision trees and then average the results from them to achieve the final model, called Bagging.

The problem is that we rarely have N different datasets, thus we use bootstraping. Bootstraping is refered to in statistics as the action of resampling with replacement from a set to create smaller datasets that still represents the same distribution.

How well did you know this?

Not at all

Perfectly

What is Random Forest and how does it compare to CART and Bagging?

RF is another algorithm that is based upon bagging. When doing Bagging there is a correlation between the datasets which causes the decrease in variance from averaging these trees to deminish.

In order to de-correlate the trees RF will introduce randomly perturbing trees. Meaning that for each split in the tree only a random subset of the inputs will be considered as spliting variables. This will further de-correlate bagging and result in a more generalizable solution.

How well did you know this?

Not at all

Perfectly

What are the computational advantages of Random Forest?

Fewer variables are handled during training which reduces computation cost and it is also easy to do in parallel(though also true for bagging).

How well did you know this?

Not at all

Perfectly

What is Boosting?

This method trains multiple “weaker” models to combine into one “strong” model. The idea is that the weaker models all can capture some small part of the relationship between input and output, thus combining them should capture most of the relationship.

The procedure:
1. Each weak model is trained on weighted data and after one tree is trained the weights for the data missclassified are increased. Thus for the next tree those datapoints will be of higher priority to classify correctly.
2. Combine all the weak trees into once model using a weighted majority vote (for regression a weighted average?)

Boosting can be used for most supervised learning algorithms

How well did you know this?

Not at all

Perfectly

What are the differences between Bagging and Boosting?

Study These Flashcards

Bagging learns models in parallel and with bootsrapped datasets.
Boosting trains model sequentially and each model try to improve from the mistake of the previous model.

The nature of bagging (random sample) will reduce overfitting and the resulting prediction is based on the average of all models, thus adding more weak models doesn’t mean that the result improves.

Boosting on the other hand trains on all the data and try to improve each model on that data, thus it is easily overfitted. But on the other hand adding more models can really improve the final result.

If the data is imbalanced there is a bigger chance that a boosting algorithm would pick up on this since it is trying to improve on the errors from previous models, whilst this imbalance can be easily missed in bagging.

Bagging methods are typically used on weak learners that exhibit high variance and low bias, whereas boosting methods are used when low variance and high bias are observed

What are Ensemble methods?

Study These Flashcards

The umbrella term for methods that average or combine multiple models, for instance like bagging or boosting methods.

What is Deep Learning?

Study These Flashcards

A ML class of algorithms and models that use combined multiple layers (NN) each of which is a nonlinear transformation

When and why is the output from a neural network passed through an activation function?

Study These Flashcards

For classification a last activation function, often softmax, is used to convert the output to probabilities for input to belong to each class.
For the binary case the logistic function is often used.
One-hot encoding can also be used where 1 one indicates the input belonging to class 2 for instance [0, 1, 0] if number of classes are 3

What is Precision? What is Recall? How is the trade-off between these two?

Study These Flashcards

A way to evaluate classification models.

Precision = TP/(TP+FP)

So the ratio of True positives over all positive classifications. Indicating how many “unneccessary” positives are being classifies, i.e. FP

Recall = TP/(TP+FN)

Recall gives us the ratio of TP over all Positives in the data set. This number indicates how many positives are missed. Recall is also known as Sensitivity or True Positive Rate (TPR)

Recall and Precision are in trade-off with eachother, if we max Recall we will classify all P as P which could lead to a lot of FP. Thus this would reduce the Precision and vice versa.

Why is the accuracy rarely a good selection as evaluation method?

Study These Flashcards

Accuracy = All Correct Classifications/All Classifications

i.e.:

Accuracy = (TP+TN)/(TP+TN+FP+FN)

This gives an intutitve idea of how well a model is working. But it can be very missleading and might not encapture everything we are looking for. For instance if a data set is very unbalanced and there are more Positives compared to Negatives, we might get a high accuracy if the model classifies all Positives correctly and none of the Negatives correctly. This mistake would be reflected in Precision.

It can also be more important to not mistakenly classify a Positive as a Negative due to the problem, in such a situation Recall is better used as a evalution metric since it will take FN into account.

What is the F1-score?

Study These Flashcards

If both Recall and Precision is important one can use the F1-score to evaluate a model. The F1-score is the Harmonic mean of the Precision and Recall. Maximizing the F1-score is the same as maximizing the possible Recall and Precision for the model at the same time. This removes the difficulties of trying to balance the recall and precision by combining it into just one score for the model.

F1-Score = 2(PrecisionRecall)/(Precision + Recall)

For imbalanced classes in the data there is Sample-weighted F1-score, where you calculate the F1-score per class and then sum all scores weighted by the class imbalance.

What is the ROC curve?

Study These Flashcards

The Receiver Operating Characteristic Curve shows the relationship between the True Positive Rate and the False Postive Rate, and is a visualization of Recall and Precision.

FPR = FP/(TN+FP)

The Diagonal dotted line indicates a random classifier. How we draw this curve is by setting a threshold on the probability saying that the modle will classify a point as positive if the probability of the patient being positive is greater than the current threshold. The threshold is 1.0 at the origin and then goes to 0 at (1, 1), thus when the threshold is close to the origin the recall is at its lowest since model will classify almost no points as Positive and the FN will be very large at the same time the Precision will be larger since the number of FP will be very close to 0 since almost no points are classfied as P.

What is AUC?

Study These Flashcards

Area Under Curve measurement of the models performance. Calculate the area under the ROC curve, a value close to 1 indicates a model that has a good classification.

What is Gradient Boost?

A sequential boosting algorithm where each weak model is fitted to the residual errors of the previous weak model. It can be used for Classification and Regression. It can benefit fromregularization techniques to avoid overfitting.

What is Online and Offline training? How does this affect the choices we as developer makes?

Online training is when the model is continously training while deployed. Offline is when we train a model and then deploy the model. For online training we must limit ourselves when thinking about computational power and also the computational time, there is a bigger need for efficiency. Ex. in games the FPS and performance could be greatly reduced by online training, thus offline is a better choice. Need to consider how fast the model responds during inference as well.

What is AdaBoost?

It's a boosting algorithm, Adaptive Boosting. As in boosting all data points get a weight assigned, in the begining they all have the same weight. For each model the weights will be updated s.t. the missclassified points are more important to get right for the next model. Sequential since weights for model N is dependent on the result from model N-1

What different regularization techniques are there in deep learning?

DropOut - Randomly "turn off"/deactivate weights in the layers of the network. Generalizes and shoudl strengthen the network. KL-penalty- Penalize to large changes between model weights during update L2 -regularization - Penalize large weights in the loss function (aka Ridge regularization)

What different ways are there to do data augumentation?

Gaussian Noise - add it to the data. Goal-Based BC - Add a goal, this goal will be randomly calculated from the input data

Machine Learning Flashcards

(29 cards)