# logistic regression Flashcards

1
Q

A

predicted probabilities may be below 0 or above 1

2
Q

what does logic(p) equal to?

A

ln(p/1-p)=β0+β1*x (β1 is the expected increase in log-odds when X increases by one unit)

3
Q

intercept in odds?

A

e^β0

4
Q

slope

A

e^β1

5
Q

can estimate β be interpreted as a change in the probability Y=1 associate with unit change in X?

A

No. Odds not linear

6
Q

sensitivity?

A

TP/P (used if FN more costly than FP), RAISE SENSITIVITY BY CLASSIFYING MORE AS ‘YES’ (less FN but more FP, specificity reduced)

7
Q

true positive rate?

A

TP/P (Sensitivity 1 – Type 2 error)

8
Q

false positive rate?

A

FP/N (1 – Specificity Type 1 error)

9
Q

positive prediction rate?

A

TP/hat P (precision)

10
Q

negative prediction rate?

A

TN/ hat N

11
Q

what doesROC (Receiver Operator Characteristic) curve traces out?

A

true positive rate and false positive rate as we vary the probability threshold from 0 to 1

12
Q

AUC is the area under the ROC curve. what does it measure?

A

it measures overall performance of classifier (max AUC=1) the larger the better the classifier

13
Q

what is the chance line?

A

random guess can produce the classifier at a 45 degree angle. no classifier should be worse than this line. AUC=0.5

14
Q

for cross validation, what is used instead of MSEs

A

number of misclassified observations

15
Q

converting factor variable for numeric linear regression (has negative values so ignore)

A

Default\$default_yes = ifelse(Default\$default == “Yes”, 1, 0)
lm_fit = lm(default_yes ~ balance, data = Default)
summary(lm_fit)

16
Q

to tell R to use logistic regression,

A

use family=binomial
e.g.glm_fit1 = glm(default ~ student,
data = Default, family = binomial)
summary(glm_fit1)

17
Q

to make predictions?

A

predict(glm_fit1,newdata=data.frame(variable,c(option1,option2 etc)),type=’response’)

18
Q

find probability of first 10 predictions?

A

glm_prob = predict(glm_fit2, type = “response”)

glm_prob[1:10]

19
Q

confusion matrix of probability threshold of 0.5?

A

confusion_matrix=table(glm_prob>0.5,Default\$default)

20
Q

to find AUC(area under ROC curve)?

A

pred=prediction(glm_prob,Default\$default)
perf=performance(pred, measure=’tpr’, x.measure=’fpr’)
auc_perf=performance(perf, measure=’auc’)
round(auc_perf@y.values[[1]],2)

21
Q

plot ROC curve with chance line?

A

plot(perf)
abline(0,1,lwd=1,lty=2)
# Add text to the ROC plot
text(0.4, 0.8, paste(“AUC =”, round(auc_perf@y.values[[1]], 2)))

22
Q

find accuracy?

A

accuracy_perf = performance(pred, measure = “acc”)
plot(accuracy_perf, col = “deeppink3”, lwd = 2)
ind = which.max(slot(accuracy_perf, “y.values”)[[1]])
acc = slot(accuracy_perf, “y.values”)[[1]][ind]
cutoff = slot(accuracy_perf, “x.values”)[[1]][ind]
print(c(accuracy = acc, cutoff = cutoff))

23
Q

A

points(cutoff, acc, type = “p”)

text(0.6, 0.86, “(0.4299, 0.9740)”, cex = 0.85)

24
Q

most accurate confusion matrix?

A

confusion_matrix=table(glm_prob>cutoff, Default\$default)

confusion_matrix

25
Q

find point on ROC?

A
```sensitivity = 124 / (124 + 209)
specificity = 9615 / (9615 + 52)
TPR = sensitivity
FPR = 1 - specificity
plot(perf)
abline(0, 1, lwd = 1, lty = 2)
text(0.4, 0.8, paste("AUC =", round(auc_perf@y.values[[1]], 2)))
points(FPR, TPR, type = "p", pch = 16)```
26
Q

model validation: validation set approach

A
```set.seed(100)
#generate 5000 random numbers from 1-10000
ind=sample(10000,5000)
training = Default[ind, ]
testing = Default[-ind, ]
glm_train = glm(default ~ balance + student,
data = training,
family = "binomial")
summary(glm_train)
glm_prob = predict(glm_train,
newdata = testing,
type = "response")
glm_pred = rep("No", 5000)
# Classify the individual to the default category if the posterior probability is greater than 0.5
# That is, replace "No" with "Yes" in vector if the predicted probability in "glm_prob" is greater than 0.5
glm_pred = ifelse(glm_prob > 0.5, "Yes", "No")
# Confusion matrix
table(glm_pred, testing\$default)```
27
Q

another method to find accuracy?

A
```accuracy = function(response, predict) {
mean((predict <= 0.5) & response == 0 | (predict > 0.5) & response == 1)
}
#predict<=0.5 & response ==0 are the true positives
# We can verify if the cost function is written correctly
response = ifelse(testing\$default == "Yes", 1, 0)
predict = glm_prob
accuracy(response, predict)```
28
Q

k-fold CV

A

set.seed(100)
glm_fit2=glm(y~var1+var2+var3, data=Default, family=binomial)
cv_error = rep(0, 2)
# Store the K-fold error rate into cv_error Use K = 5 and K = 10
cv_error[1] = cv.glm(Default, glm_fit2, accuracy, K = 5)\$delta[1]
cv_error[2] = cv.glm(Default, glm_fit2, accuracy, K = 10)\$delta[1]
cv_error