Machine learning Flashcards

(30 cards)

1
Q

What are the variable types? Draw the diagram

A

Discrete Continuous
/ \ |
Categorical Numerical Numerical
/ \ | |
Nominal Ordinal Interval Ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Properties of attribute types

A

Nominal Ordinal Interval Ratio
Distinct ✅ ✅ ✅ ✅
Order ❌ ✅ ✅ ✅
Addition ❌ ❌ ✅ ✅
Multiplication ❌ ❌ ❌ ✅

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is global standardisation?

A

Feature scaling to ensure all features have the same mean and SD especially for larger features that dominate smaller features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is dimensionality reduction?

A

A technique to reduce the number of features in a dataset while preserving the relevant information (PCA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is feature selection?

A

Selects a subset of original features that are the most relevant which acts as feature removal. Reduces dimensionality of dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is feature extraction?

A

Identification of a reduced set of transformed features which contributes to a reduction of dimensionality in the dataset (PCA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Benefits of feature selection

A

Simplifies the model, reduces size of dataset, improves model accuracy, more efficient in training and easier to interpret

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Most important things to check for in data cleaning

A

Outliers, missing values and duplicates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Benefits of sampling

A

It allows for quicker analysis when analysis on whole dataset is not feasible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Types of sampling

A

Subsampling, sampling, re sampling, random sampling and stratified random sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is subsampling

A

Used for data reduction by selecting a subset of original dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is sampling?

A

Creation of training and testing datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is resampling?

A

Repeatedly drawing samples to estimate the characteristics of the whole dataset. Used for bias removal (bootstrapping)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is random sampling?

A

Without replacement (pick balls out the bag)
With replacement (picking balls but putting them back)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is stratified random sampling?

A

Random samples are taken from each variable based on relevant features. Used to contribute to model performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the holdout method?

A

Splitting the dataset into 2 parts and holdout part of the dataset and use the other part to train (test and train)

17
Q

What is k fold cross validation?

A

Evaluates model performance by dividing the dataset into k number of folds. Once one is used, the process is repeated k times with a different combination each timw

18
Q

What is bootstrapping?

A

Create repetitions of the sets using random sampling generally > 1000 which is used to reduce bias and make dataset more robust

19
Q

Types of model performance evaluation

A

Regression, classification and binary classification

20
Q

How do you do regression performance evaluation?

A

Mean squared error, root mean square and mean absolute error

21
Q

How do you do classification performance evaluation?

A

Actual class value vs predicted class value, confidence on prediction and confusion matrix

22
Q

How do you do binary classification performance evaluation?

A

Using class of interest instances falling into positive for the class and negative for the other classes

23
Q

How to evaluate model performance when there’s a class imbalance for both regression and classification?

A

Kappa statistic
Roc curve

24
Q

What does the kappa statistic do?

A

It adjusts accuracy for correct predictions by chance and partially mitigates the effects of class imbalance

25
K nearest neighbour benefits and disadvantages
Simple and easily explained and interpreted Fast in training Both classification and regression Doesn’t produce a model Slow in classification Difficult to select k
26
SVM benefits and disadvantages
High accuracy in prediction Compact model representation Not prone to overfishing Difficult to select kernel/parameters Slow to train for large datasets Not easily explained
27
Decision trees benefits and disadvantages
Human readable structural pattern Transparent model for numeric and nominal features Easily explained and interpreted Inefficient for high dimensional data Tendencies to overfit Sensitive to small perturbations
28
Random forests benefits and disadvantages
Efficient for high dimensional data Insensitive to noise/missing data Directly selects important features for both numerical/nominal features Not easily explained or interpreted More difficult to tune Biased for features with many levels
29
Neural networks benefits and disadvantages
Model complex patterns in data Both classification and regression No assumption of data relationships Not easily interpreted Prone to overfitting in training Computationally intense to train
30
Centre based clustering k means benefits and disadvantages
Efficient (linear complexity Automatic and no cut off required Depends on intial seeds Often trapped in local minima Inefficient for high dimensional data