Python ML Principles Flashcards

Learn the main steps and sub-steps of ML.

1
Q

What are the four main steps in ML?

A

Visualisation
Cleaning and Transformation
Construction of ML model
Evaluation of ML model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two main sub-steps of the Cleaning and Transformation step?

A

1) Data Preparation & Cleaning

2) Feature Engineering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What should you do before starting Preparation & Cleaning?

A

Explore the data to understand the issues that are present.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are six sub- steps of the Data Preparation & Cleaning step?

A
  1. Recode chr strings to eliminate unrecognised characters
  2. Find & treat missing values
  3. Set correct data type and column
  4. Transform categorical features to increase cases
  5. Apply transformation to numerics to improve distributions
  6. Duplicate management
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What’s another name for “transformation to improve distribution” ?

A

Feature engineering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Name a common transformation.

A

Log

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the main thing we’re trying to achieve with Feature Engineering?

A

We’re trying to achieve distinct separation of the labelled cases, indicating better prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a test used to evaluate linear regression accuracy?

A

Sum of squared errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is linear regression sometimes called Least Squares Regression?

A

Because it creates a line that minimises the square of variance (error) from the line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

With linear regression in scikit.learn what Python package should you use for your arrays?

A

Numpy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 4 main steps for linear regression with scikit.learn?

A

Layout numpy arrays
Scale
Specify model object
Fit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What visualisation could you use to evaluate residuals of a regression model?

A

A histogram of residuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What two residuals histogram pattern indicate an accurate model?

A
  1. Clustering of residuals around zero.

2. Normal distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

If you see a multimodal residuals histogram what should you do about the non-zero modes?

A

Investigate what’s creating them and consider adding these features to your model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

In scikit learn, what is onehot encoding?

A

Conversion of multiple feature options to a numpy array, where only one row for the record shows a “1”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In classification ML, what is the space between the two outcomes called?

A

The decision boundary.

Should be zero, with Y/N result giving -ve +ve values.

17
Q

What is a loss function?

A

A weighting line in classification that describes how much incorrect labels are to be penalised.

18
Q

Name four ways to check if your model is CRAP.

A

Confusion matrix (TP,FP,TN,FN)
ROC curves
Accuracy/misclassification error
Precision/Recall/F1

19
Q

What is on the y-axis and what on the x-axis of an ROC curve?

A
y-axis = TPR
x-axis = FPR
20
Q

Name 4 techniques to address imbalanced data.

A

1) Undersample majority
2) Oversample minority
3) Case weights
4) Impute

21
Q

Name one method of dimensionality reduction.

A

Principal component analysis.

22
Q

What data does Principal Component Analysis produce on the dataset features?

A

Variability in the label explained by each feature.

23
Q

What is the goal of regularisation?

A

Prevent overfitting of ML models.

24
Q

What is the tradeoff dilemma in Regularisation?

A

Bias vs Variance

25
Q

Regularisation reduces variance, but can introduce what?

A

Bias

26
Q

A model that has high variance and low bias is what?

A

Overfit

27
Q

The diagonal line on a QQ Residual vs Predicted plot represents what?

A

A perfect Normal Distribution.

28
Q

What’s another name for L2 regularisation?

A

Ridge Regression

29
Q

What does L2 regularisation do to the coefficients?

A

Constrains them, by driving coefficients close to zero.

30
Q

What are two other names for L1 regularisation?

A

Lasso method.

Manhattan norm.

31
Q

What is k-fold cross-validation?

A

Resampling random subsets and repeating training calculations, while holding 1 subset back for testing.

32
Q

With decision trees, what kind of split creates less entropy?

A

A split where there are very uneven probabilities of the outcomes.

Entropy is high = 1 if p = 0.5.