Machine Learning Flashcards
(97 cards)
What is type 1 error?
False positives. We predicted they have the disease, but they actually dont.
What is type 2 error?
False negative. We predicted they don’t have the disease, but they actually do.
What rates are computed from a confusion matrix?
Accuracy, misclassification rate, true positive rate, false positive rate, specificity, precision. and prevalence.
What is specificity?
When it is actually no, how often does it predict no?
What is precision?
When it predicts yes, how often is it correct?
What is prevalence?
How often does the yes condition actually occur in our sample?
What is the formula for misclassification rate?
(FN+FP)TOTAL or 1 - Accuracy
What is the formula for accuracy?
(TN+TP)/TOTAL
What is sensitivity or recall?
When it is actually yes, how often does it predict yes?
Whats another word for sensitivity?
Recall
What is the ROC curve? What is on the Y-axis and X-axis?
A graph commonly used to summarize the performance of a classifier. Y-axis is the TP rate. X-axis is the FP rate.
What is AUC and what is the range?
AUC is the area under the curve which ranges from 0.5 to 1. Excellent classifier is 1 whereas a bad classifier is 0.5
What is a confusion matrix?
A table used to describe the performance of a classification model on the test data.
What are the assumptions of the naive bayes algorithms?
Each feature is independent and makes an equal contribution.
What is Bayes theorem?
P(A|B) = P(B|A)P(A)/P(B)
Name the parts of the Bayes theorem
Posterior Probability = Likelihood * Prior / Evidence
What should you do when you have a lot of missing values in your data?
- Are the missing values at random or not random?
- If they are at random, can the missing values be removed without compromising the dataset?
- If they can’t be removed try imputation methods such as mean, regression, KNN.
- Alternate option is to use learning methods that can handle missing data such as ensembles, decision tree.
What is considered clean/tidy data?
If each variable has it’s own column and each observation is in it’s own row.
What are your favorite packages in R for data clean up and manipulation?
Tidyr to clean up and dplyr to manipulate the data.
What are two types of data clean up conversions?
Long to wide: Values in a column should be a variable.
Wide to long; Columns that are not variables.
What classifier should you choose?
For optimal accuracy you should test out multiple classifiers and choose the best one through cross validation. Otherwise to start off it depends on how large the training set is and are you looking for predictability or interpretability.
Tactics for helping sales.
Tracking sales by marketing campaigns.
Provide actionable steps to close leads.
What is linear regression?
A statistical method for modeling a relationship between variables. Most commonly done so by fitting a line the best fits the data by minimizing the deviations.
Compare univariate, bivariate, and multivariate.
Univariate: single variable analysis
Bivariate: Relationship between two variables.
Mulitvariate: Relationship between three or more variables.