Classification Flashcards

(30 cards)

1
Q

impute/imputation

A

In statistics, imputation is the process of replacing missing data with substituted values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

confusion matrix

A
  • A cross-tabulation of our model’s predictions against actual values
  • A matrix (table) used to measure the performance of a machine learning algorithm
  • Rows: actual classes (Ci)
  • Columns: predicted classes (Cj)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 4 possible outcomes of classification task?

A

True Positive
False Postive
False Negative
True Negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the common choice for the baseline model for a classification problem?

A

a model that simply predicts the most common class every single time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the common evaluation metrics for a classification problem/model?

A
  • Accuracy
  • Precision
  • Recall
  • Specificity
  • f1 score
  • ROC curve
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is accuracy?

A

the number of times we predicted correctly divided by the total number of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is precision / positive predictive value?

A

the percentage of positive predictions that we made that are correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is recall / true positive rate / sensitivity?

A

the percentage of positive cases we accurately predicted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is specificity / true negative rate?

A

the percentage of negative cases we accurately predicted.

The percentage of predicting true negative out of all negatives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

logistic regression

A
  • A regression algorithm
  • To find the values of the coefficients that weight each input variable)
  • To assign observations to a discrete set of classes
  • To predict discrete outcomes
  • binomial and multinomial
  • The output is a value between 0 and 1 that represents the probability of one class over the other.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

regularized least squares

A
  • A way of solving least squares regression problems
  • An extra constraint on the solution, which is called regularization
  • It adds a penalty term to the error.
  • A argument in LogisticRegression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the components of a decision tree?

A
  • root
  • condition/internal node
  • branches/edges
  • decision/leaf
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is classification tree?

A

to classify the outcome variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is regression tree?

A

to predict continuous values like price of a house

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

CART = ?

A

classification and regression trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Recursive binary splitting / greedy algorithm

A
  1. Consider all the features
  2. Use a cost function to try and test all different (candidate) split points
  3. Select the split with the best/lowest cost.
  4. Make the root node the best predictor/classifier
17
Q

What are Decision Trees?

A
  1. Is a supervised machine learning process / train on labeled data.
  2. Use the training data to train the tree to find a decision boundary / a sequence of rules.
  3. Use the boundary as a decision rule to classify 2 or more classes.
18
Q

What does each node represent?

A
  • Is a splitting point in the decision tree.
  • Represents a single input variable(x).
  • a split point or class of that variable
19
Q

What are the pros of decision tree?

A
  1. Simple to understand
  2. Simple to visualize
  3. Simple to explain the output
  4. Requires little data preparation
  5. Don’t need to encode our target variable
  6. Perform well for a broad range of problems
20
Q

What is f1-score?

A
  1. harmonic mean of Recall and Precision
    - giving both metrics equal weight.
  2. When you are looking to optimize for both Recall and Precision.
21
Q

What is support?

A

number of occurrences of each class in where y is true.

22
Q

What is overfitting?

A

Don’t generalize the data well.

23
Q

How to avoid overfitting in Decision-tree?

A

Mechanisms such as

  1. Pruning.
  2. Set the minimum number of samples required at a leaf node
  3. Set the maximum depth
24
Q

How to handle overfitting?

A
  1. Obtain more training data
  2. Feature engineering
    3.
25
What is Logistic Regression model?
1. Maps any real value into a number between 0 and 1, - representing the probability that - an observation is in the positive class.
26
How does the threshold in LR model affect the metrics?
1. decrease the threshold, Recall increases. | 2. increase the threshold, Precision increases
27
What is ROC curve?
1. Receiver Operating Characteristic Curve 2. Summarize the trade-off between TPR and FPR - for a predictive model - using different probability thresholds
28
How to calculate baseline prediction?
1. Average value of dependent variable
29
What is encoding?
Transform categorical variables to binary or numeric counterparts
30
What is the purpose of splitting data?
to avoid overfitting the model to one sample