Classification Flashcards

Question 1

Q

impute/imputation

Answer

A

In statistics, imputation is the process of replacing missing data with substituted values.

Question 2

Q

confusion matrix

Answer

A

A cross-tabulation of our model’s predictions against actual values
A matrix (table) used to measure the performance of a machine learning algorithm
Rows: actual classes (Ci)
Columns: predicted classes (Cj)

Question 3

Q

What are the 4 possible outcomes of classification task?

Answer

A

True Positive
False Postive
False Negative
True Negative

Question 4

Q

What is the common choice for the baseline model for a classification problem?

Answer

A

a model that simply predicts the most common class every single time

Question 5

Q

What are the common evaluation metrics for a classification problem/model?

Answer

A

Accuracy
Precision
Recall
Specificity
f1 score
ROC curve

Question 6

Q

What is accuracy?

Answer

A

the number of times we predicted correctly divided by the total number of observations

Question 7

Q

What is precision / positive predictive value?

Answer

A

the percentage of positive predictions that we made that are correct.

Question 8

Q

What is recall / true positive rate / sensitivity?

Answer

A

the percentage of positive cases we accurately predicted.

Question 9

Q

What is specificity / true negative rate?

Answer

A

the percentage of negative cases we accurately predicted.

The percentage of predicting true negative out of all negatives.

Question 10

Q

logistic regression

Answer

A

A regression algorithm
To find the values of the coefficients that weight each input variable)
To assign observations to a discrete set of classes
To predict discrete outcomes
binomial and multinomial
The output is a value between 0 and 1 that represents the probability of one class over the other.

Question 11

Q

regularized least squares

Answer

A

A way of solving least squares regression problems
An extra constraint on the solution, which is called regularization
It adds a penalty term to the error.
A argument in LogisticRegression

Question 12

Q

What are the components of a decision tree?

Answer

A

root
condition/internal node
branches/edges
decision/leaf

Question 13

Q

What is classification tree?

Answer

A

to classify the outcome variable

Question 14

Q

What is regression tree?

Answer

A

to predict continuous values like price of a house

Question 15

Q

CART = ?

Answer

A

classification and regression trees

Question 16

Q

Recursive binary splitting / greedy algorithm

Answer

A

Consider all the features
Use a cost function to try and test all different (candidate) split points
Select the split with the best/lowest cost.
Make the root node the best predictor/classifier

Question 17

Q

What are Decision Trees?

Answer

A

Is a supervised machine learning process / train on labeled data.
Use the training data to train the tree to find a decision boundary / a sequence of rules.
Use the boundary as a decision rule to classify 2 or more classes.

Question 18

Q

What does each node represent?

Answer

A

Is a splitting point in the decision tree.
Represents a single input variable(x).
a split point or class of that variable

Question 19

Q

What are the pros of decision tree?

Answer

A

Simple to understand
Simple to visualize
Simple to explain the output
Requires little data preparation
Don’t need to encode our target variable
Perform well for a broad range of problems

Question 20

Q

What is f1-score?

Answer

A

harmonic mean of Recall and Precision
- giving both metrics equal weight.
When you are looking to optimize for both Recall and Precision.

Question 21

Q

What is support?

Answer

A

number of occurrences of each class in where y is true.

Question 22

Q

What is overfitting?

Answer

A

Don’t generalize the data well.

Question 23

Q

How to avoid overfitting in Decision-tree?

Answer

A

Mechanisms such as

Pruning.
Set the minimum number of samples required at a leaf node
Set the maximum depth

Question 24

Q

How to handle overfitting?

Answer

A

Obtain more training data
Feature engineering
3.

Question 25

Q

What is Logistic Regression model?

Answer

A

Maps any real value into a number between 0 and 1,
- representing the probability that
- an observation is in the positive class.

Question 26

Q

How does the threshold in LR model affect the metrics?

Answer

A

decrease the threshold, Recall increases.

2. increase the threshold, Precision increases

Question 27

Q

What is ROC curve?

Answer

A

Receiver Operating Characteristic Curve
Summarize the trade-off between TPR and FPR
- for a predictive model
- using different probability thresholds

Question 28

Q

How to calculate baseline prediction?

Answer

A

Average value of dependent variable

Question 29

Q

What is encoding?

Answer

A

Transform categorical variables to binary or numeric counterparts

Question 30

Q

What is the purpose of splitting data?

Answer

A

to avoid overfitting the model to one sample

Brainscape's Knowledge GenomeTM

Classification Flashcards

Brainscape's Knowledge Genome^TM