Model Concepts Flashcards by Daniel Casley

What is model complexity?

A measure of how well the model can capture underlying patterns in the data

How well did you know this?

Not at all

Perfectly

Vector/linear regression models often measure model complexity as…

The polynomial degree

How well did you know this?

Not at all

Perfectly

Machine learning models often measure model complexity as…

The number of parameters in the model

How well did you know this?

Not at all

Perfectly

A good time to stop increasing model complexity is when…

Cross validation error starts to increase

How well did you know this?

Not at all

Perfectly

What is bias?

The tendency to miss or be inaccurate

How well did you know this?

Not at all

Perfectly

What is variance?

The tendency to be inconsistent

How well did you know this?

Not at all

Perfectly

If we have high bias in the model, it will fail to…

Accurately capture the relationship between features and outcome variable, however it will be wrong consistently

How well did you know this?

Not at all

Perfectly

If we have high variance in the model, but low bias, the model will…

Identify properly the relationship between the features and outcome variable, but will also incorporate random noise

How well did you know this?

Not at all

Perfectly

What is irreducible error?

Randomness/luck in the data points that does not relate to the data at all, typically from real world data

How well did you know this?

Not at all

Perfectly

What is the bias-variance tradeoff?

Model adjustments that decrease bias, often increase variance, and vice versa, therefore this tradeoff is analogous to a complexity tradeoff

How well did you know this?

Not at all

Perfectly

Lower degrees of complexity cause [bias/variance], while higher degrees cause [bias/variance].

Bias, variance

How well did you know this?

Not at all

Perfectly

What is shrinkage/regularisation?

Adding a small adjustable regularisation parameter into the cost function which adds a penalty proportional to the size of the model parameter, thereby penalising more complex models

How well did you know this?

Not at all

Perfectly

What issue does regularisation solve, and why?

Bias-variance tradeoff, as a higher regularisation strength parameter introduces a simpler model, thereby adding bias, while less regularisation makes the model more complex, adding variance

How well did you know this?

Not at all

Perfectly

What is ridge regression (or L2 regularisation)?

The penalty is applied proportionally to squared coefficient values

How well did you know this?

Not at all

Perfectly

How can we find the best regularisation parameter?

Cross-validation, testing each segment on a different regularisation parameter

How well did you know this?

Not at all

Perfectly

What is LASSO (or L1 regularisation)?

The penalty is applied proportionally to absolute coefficient values

How well did you know this?

Not at all

Perfectly

What is the difference between L1 and L2 regularisation?

Both methods go to zero in different ways - L2 applies smooth but strong regularisation, while L1 is more stable

How well did you know this?

Not at all

Perfectly

Regularisation can perform feature selection by…

Study These Flashcards

Shrinking some features’ contributions to zero

What is feature selection?

Study These Flashcards

Selecting only the most important features from your data, deleting the rest

How can we perform efficient feature selection via cross-validation?

Study These Flashcards

Removing values one at a time and measuring predictive results - if the removed feature improves or doesn’t change the results, it can be removed

What is gradient descent?

Study These Flashcards

An iterative approach to fitting any machine learning model by adjusting weights based on the loss calculated from the loss function

Why is gradient descent better than grid search and random sampling for finding optimal parameters?

Study These Flashcards

Random sampling and grid search will take too long to converge, with grid search just being more uniform

How does gradient descent minimise the loss of a model?

Study These Flashcards

We start at a random point in parameter space, and calculate the error. We then adjust our parameters using the gradient of the parameters with respect to the magnitude of error

What is L1 and L2 norm?

Study These Flashcards

Two methods of calculating errors, wherein L1 is an absolute sum of errors, and L2 (also known as Euclidean distance) is a sum of squared errors rooted.

What is meant by Lp norm?

Lp norm uses a stronger root than L2 norm's squared error, such as L3 being cube error, L4 being quadratic error, etc.

The minimum error of a gradient descent graph is the point where...

The derivative of all dimensions are equal to exactly zero

What is a confusion matrix?

A diagram used to visualise the accuracy of a classifier

How is a confusion matrix constructed?

We graph predicted values along the y axis, compared to their true values along the x axis, representing True and False Negatives and Positives

What is precision?

The probability that, given a true positive, we will predict positive

What is recall?

The probability that we will correctly classify a value given a true positive example

How is precision calculated?

The number of true positives divided by the number of total predicted positives

How is recall calculated?

The number of true positives divided by the number of positives in the ground truth data

High recall and low precision implies that...

The positive values are mostly correctly classified, but negative values are not

Low recall but high precision implies...

We are predicting negative a lot, but when we do predict positive, we are usually correct

What is F1 score?

A combination of precision and recall, with high F1 scores implying a good balance between the two

What is sensitivity?

How many positives are correctly positive

What is specificity?

How many negatives are correctly negative

How is specificity calculated?

The inverse of accuracy, such that we divide the number of true negatives by the number of predicted negatives

How is sensitivity calculated?

The exact same as recall, such that we divide the number of true positives by the number of predicted positives

What is a ROC curve?

A combination of sensitivity and specificity, with a high ROC curve implying better performance overall

What is the shape of an ideal ROC curve?

An exact right angle, implying that the area under the curve is exactly 1

How do we construct a ROC curve?

At each point on the graph, take 1 minus the specificity, and model it against the sensitivity

How can we compare a ROC curve to another?

Calculate the area under the graph, a ROC curve with an AUC of 1 is optimal

Model Concepts Flashcards

(43 cards)