The machine learning process Flashcards
What is feature engineering?
The process of crafting features from existing columns using domain knowledge and intuition.
Give an example of where feature engineering might improve a machine learning model
E.g. prediction of life satisfaction using GDP. Intuitively, there are diminishing returns. The relationship is more likely to be logarithmic: i.e. a doubling of wealth leads to a set increase in happiness. So use log(GDP).
Define polynomial regression.
The use of polynomial features: e.g. x^2, x^3 etc. Draws from Taylors theorem: any analytic function has a polynomial series expansion.
What is linear truncation and how is it implemented?
Introduce a kink point where the trend in the data seems to change, fit a linear regression with two weights: one for the data before the kink and one after.
How can you deal with multiple categorial variables?
One-hot encoding: a binary classifier for each category. I.e. IsDog, IsCat and IsMouse
Explain feature selection
The process of selecting a good subset of features to use in a model.
How might you manually select features?
If features are standardized, OLS coefficients might indicate how important each feature is.
We could look for features that are correlated with the outcome.
Explain best subset selection.
Find the subset of features such that the generalisation error is minimised.
What are the steps of the forward subset selection algorithm?
Start with a constant model, for each additional feature, only select the one which results in the greatest risk reduction, and store that model. Then there are a sequence of models to choose from, select the one with the lowest generalisation error.
In machine learning it is common to split your data into three, what are each of these subsets called and what are they used for?
Training: this set is used to train the model by minimising the Risk given the labels.
Validation: this set is unseen in the training stage, but is used to tune hyperparameters (e.g. number of features used).
Test: this set is used to measure model metrics such as accuracy, loss etc.
What is a hyperparameter?
A hyperparameter is a model parameter that is independent of the data. For example: learning rate, tree depth etc. It should be tuned for optimal generalisation performance.
Explain the idea of k-cross validation
Useful for hyperparameter tuning when the data set is small. Divide data into partitions of similar size. Train the model on all data not in that partition, then evaluate the model on the partition, repeating for each partition. Report statistics for the metrics over the partitions.
Why is it important to consider the sampling for splitting the data?
Training data should be representative of the population. Sampling might be necessary to avoid over/under representation among classes.
What is stratified sampling?
Data is grouped into classes, and points are sampled from each class and recombined into training and test sets.
How can this be implemented on multiple columns?
Make more columns with combinations of the features, then stratify on that.
How can it be implemented on a continuous variable?
The data can be grouped by quantiles and stratified.
What is the R-squared and for what type of model can it be used as an evaluation metric?
1 - (residual sum of squares over total sum of squares). This can be used on regression models.
What are the disadvantages to using average log loss and accuracy score to evaluate classification models?
These errors do not illuminate how the model is misclassifying, which could be important information depending on the business problem.
What does the confusion matrix show?
The total true negatives, false negatives, false positives and true positives
Which parts of the confusion matrix might be more important in spam classification? What about cancer screening?
In spam classification, we care more about minimising FP than FN because missing a non-spam email is worse than seeing a spam in your inbox.
Cancer screening: we minimise FN over FP because being sent for further tests without actually having cancer much preferable to missing a sign of cancer.
What are the true positive/true negative rates?
TPR = TP / (All positives) = TP / (TP + FN) TNR = TN / (All negatives) = TN / (TN + FP)
For classification prediction functions, what is the cutoff hyperparameter?
For binary classification, a prediction is a probability of membership to each class. The cut-off is the probability above which is assigned a positive result.
What is an ROC curve?
This is a curve in TPR-TNR space, parameterised by the cut off, for a given classification model.
How does the ROC change with scaling of scores/probabilities
The curve remains unchanged: its parameterisation is different but it occupies the same region of TPR-TNR space.