Week 6 DSE Flashcards

(34 cards)

1
Q

What do we do before we create a machine learning model?
q

A

Visualise data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How to use box plots to tell which variables are important?

A

Look at median

must be far apart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What must we do whenever we have a categoric variable?

A

convert to numeric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the ranges of result from the logistic model?

A

Probability Between 0 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

More often than not we care about the ______ and ______ of the slope ,b1.

A

sign

relative magnitude

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When to use t value and why? For linear and multi

A

z-values: glm . Cause you CANT use least squares method . You use maximum likelihood for logistic which follows normal distribution

t-value: not glm, cause you use least squares method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does glm stand for?

A

generalised LINEAR model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What must you do in r program to specify use of logistic model?

A

glm(default ~ balance, data = Default, family = binomial)

NEED TO SPECIFY FAMILY= BINOMIAL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do i represent scaling a certian independent variable in R?

A

use I (Represents operations)

= glm(default ~ balance + I(income/1000) + student, data = Default, family = binomial)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is sensitivity?

A

measures a classifier’s ability to identify positive status

p(tested POSITIVE | total that are actually positive)

how good we are at identifying positive cases out of all that are actually posittive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is specificity?q

A

measures a classifier’s ability to identify negative status

p(tested NEGATIVE| total that are actually negative)

how good we are at identifying negative patients correctly

True negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are false positive?

A

fraction of cases that are ACTUALLY NEGATIVE, wrgonly classified as POSITIVE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is true positive?

A

Fraction of cases that are ACTUALLY POSITITVE, that are correctly classified as POSITITVE

sensitivity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What happens as decision threshold increases?

A

FPR decreases. TPR decresaes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What do we use to measure optimal decision threshold?

A

Draw multiple ROC curve at differnet thresholds

Find the one with the biggest area under the curve (AUC).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are on the axes on ROC curves?

A

Y axis: True positive

X axis: False posititve

17
Q

What is the range of values for AUC?

A

0 to 1

Best one is the one closest to 1

18
Q

What does model validation require?

A

Use another test dataset to see whehter it is good

split into train and test set

19
Q

How to measure the accuracy of model?

A

Training error

20
Q

What does test error quanitfy?

A

quantifies the predictability of the model.

21
Q

How does training error compare to test error?

A

training error - overly optimistic (UNDERESSTIMATE THE TEST ERROR)

because the model has already seen the data already,cannot predict new data

22
Q

What are the approaches to model validation?

A

validation-set approach

K-fold cross validation

Leave-one-out cross validation

23
Q

What is the most commonly applied model validation method?

A

k-fold cross validation

24
Q

GENERALLY, what are modle validation approaches like?

A

holding out subset(s) of the training observations from the model fitting process, and then applying the classifier to these held out observations.

25
What is a disadvantage of validation-set approaches?
There is a randomization part here, wher ethe erorr rate will be differnent if you divide your data differently
26
Explain what is validation-set approach?
(randomly divide the full data set into 2)
27
Explain what is the k-fold cross validation approach
1. Randomly split the full data set into K folds of equal size. 2. training set: k-1 folds, test set: 1 fold 3. Iterate the process k times and then calculate average test erorr
28
Explain leave one out cross validation
when k=n for a in range (n): test set= 1 obv training set= n-1 obv
29
How to predict multiple obv in R?
need to use the c() function df_new = data.frame(student = c("Yes", "No"), balance = c(1500, 1500), income = c(40000, 40000)) predict(glm_fit, newdata = df_new, type = "response")
30
What does type="response" mean?
return predicted probabilities
31
How to generate confusion matrix in R?
table(glm_prob >= 0.5, Default$default) where glm_prob is the model AFTER PREDICTION
32
how to write code for k fold cross validation k=5?
set.seed(1101) cv.glm(Default, glm_fit3, K = 5)$delta[1 Test error
33
How to write leave one out cross valdiation?
cv.glm(Default, glm_fit3)$delta[1] # LOOCV
34
How to get training error in R ?
glm_pred = ifelse(glm_prob_train > 0.5, "Yes", "No") mean(glm_pred != Default_train$default) # training error