Machine Learning Flashcards

(97 cards)

1
Q

What is type 1 error?

A

False positives. We predicted they have the disease, but they actually dont.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is type 2 error?

A

False negative. We predicted they don’t have the disease, but they actually do.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What rates are computed from a confusion matrix?

A

Accuracy, misclassification rate, true positive rate, false positive rate, specificity, precision. and prevalence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is specificity?

A

When it is actually no, how often does it predict no?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is precision?

A

When it predicts yes, how often is it correct?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is prevalence?

A

How often does the yes condition actually occur in our sample?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the formula for misclassification rate?

A

(FN+FP)TOTAL or 1 - Accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the formula for accuracy?

A

(TN+TP)/TOTAL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is sensitivity or recall?

A

When it is actually yes, how often does it predict yes?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Whats another word for sensitivity?

A

Recall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the ROC curve? What is on the Y-axis and X-axis?

A

A graph commonly used to summarize the performance of a classifier. Y-axis is the TP rate. X-axis is the FP rate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is AUC and what is the range?

A

AUC is the area under the curve which ranges from 0.5 to 1. Excellent classifier is 1 whereas a bad classifier is 0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a confusion matrix?

A

A table used to describe the performance of a classification model on the test data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the assumptions of the naive bayes algorithms?

A

Each feature is independent and makes an equal contribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Bayes theorem?

A

P(A|B) = P(B|A)P(A)/P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Name the parts of the Bayes theorem

A

Posterior Probability = Likelihood * Prior / Evidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What should you do when you have a lot of missing values in your data?

A
  1. Are the missing values at random or not random?
  2. If they are at random, can the missing values be removed without compromising the dataset?
  3. If they can’t be removed try imputation methods such as mean, regression, KNN.
  4. Alternate option is to use learning methods that can handle missing data such as ensembles, decision tree.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is considered clean/tidy data?

A

If each variable has it’s own column and each observation is in it’s own row.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are your favorite packages in R for data clean up and manipulation?

A

Tidyr to clean up and dplyr to manipulate the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are two types of data clean up conversions?

A

Long to wide: Values in a column should be a variable.

Wide to long; Columns that are not variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What classifier should you choose?

A

For optimal accuracy you should test out multiple classifiers and choose the best one through cross validation. Otherwise to start off it depends on how large the training set is and are you looking for predictability or interpretability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Tactics for helping sales.

A

Tracking sales by marketing campaigns.

Provide actionable steps to close leads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is linear regression?

A

A statistical method for modeling a relationship between variables. Most commonly done so by fitting a line the best fits the data by minimizing the deviations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Compare univariate, bivariate, and multivariate.

A

Univariate: single variable analysis
Bivariate: Relationship between two variables.
Mulitvariate: Relationship between three or more variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What can you tell me about the normal distribution?
- Occurs naturally in many situations - Can apply the empirical rule - Symmetric curve - Standard normal model has mean of 0 and sd of 1
26
Interpolation vs Extrapolation
Interpolation is used in estimating data points within the range of your data. Extrapolations is used in estimating data points beyond the range of your data.
27
What is k means clustering?
A type of unsupervised learning method in which the objective is to group the data into K groups. This can provide labels for the K clusters.
28
How do you select K in K means clustering?
- Prior business knowledge may suggest a particular K. | - Plot the average distance from the centroid and choose the elbow point.
29
What is the curse of dimensionality?
Increasing the number of predictors does not always improve the model as the dimensions get larger. For example in KNN when the dimensions are large, there may not be enough neighboring points to make predictions.
30
How do you deal with curse of dimensionality?
- Variable selection | - Dimension reduction techniques such as PCA.
31
What is the bias variance tradeoff?
It's a tradeoff between simpler and more complex models. Simpler models have high bias and low variance whereas complex models have low bias and high variance. Want to balance variance and bias to minimize total error. High variance occurs due to overfitting. High bias occurs due to lower predictability power.
32
What two parts must every objective function have?
Training loss and regularization.
33
what is an objective function?
A function used to measure the performance of a model given a set of parameters.
34
Explain XGBoost for classification
A tree is grown sequentially in an attempt to reduce the misclassification rate. The next tree is grown by giving a higher weight to misclassified points by the previous tree.
35
Benefits of XGBoost
- Proven winner in kaggle competitions - Allows regularization - Handles missing values - Cross validation built in
36
What are the linear regression assumptions?
Residuals need to be - Independent - Normally distributed - Have equal variance
37
What is the binomial probability formula?
P(success) = n choose k * p ^ k *(1-p)^k
38
Compare L1 vs L2
L1 is ridge regression L2 is Lasso regression L1 encourages sparsity which can aid in variable selection
39
Compares K means vs KNN
K means is an unsupervised method for clustering. KNN is a supervised method for classification.
40
What is in the name 84.51?
It is the longitudinal coordinates for the cincinnati office.
41
Who is the CEO of 84.51?
Stuart Aitken
42
What are the three core values at 84.51?
Limitless Minds, Fearless Hearts, and Relentless Delivery.
43
Explain Limitless Minds.
We fall in love with the problem and answer it with open minds and unbridled creativity.
44
Explain Fearless Hearts
We are all empowered to say and do what is right.
45
Explain Relentless Delivery
We love what we do and we see it through.
46
Speak to the job description 1: | Identify opportunities for automation of existing solutions.
Donlen Scorecard - Created a template in Tableau - Use parameters in Tableau to automatically query data from SQL based on selected parameters
47
Explain structured vs unstructured data.
Structured data has organization and can be searched using simple algorithms. Unstructured data is essentially the opposite. It has no identifiable internal structure.
48
What is statistics?
It is the science of collecting and analyzing data to for the purpose of producing meaningful insights.
49
What is significance?
It's a measure of whether the results of a test were due to chance. The higher the significance, the lower the odds of it happening by chance.
50
Define P-Value
In simple terms it is the probability of the results occuring by chance. Hence the lower the p-value the stronger the evidence of rejecting the null.
51
What is a t-test?
A test used to compare two population means
52
What is correlation?
A statistical measure of the relationship between two variables.
53
How would you identify important customer trends in unstructured data?
- Check with the business owner and understand the objectives to properly categorize the data. - Use iterative approach of pulling in new data and validating the model for accuracy. - Solicit feedback from stakeholders.
54
What is sampling?
The process of collecting data from a population to form a sample allowing you to make some inferences about the population as a whole.
55
Name 4 sampling methods
- SRS - Clustering - Systematic - Stratified
56
What is logistic regression?
A statistical learning method used to predict a binary outcome using the logit function.
57
Give an example of how you've used logistic regression.
Used it in predicting the odds of a driver getting into an accident.
58
How to detect outliers?
- Visually with a box plot or scatter plot - Normalizing the data and calculating z-scores - Use proximity based models for non-parametric problems
59
What is the difference between parametric and non-parametric?
Parametric: We make an assumption about the functional form of f in f(x). Non-Parametric: No explicit assumption is made about the functional form of f.
60
Advantages of using non-parametric methods?
The true function of the data may not follow a specific functional form. This allows f to be determined without restrictions.
61
Advantages of parametric methods?
- Interpretability - Faster learning - Less data needed
62
Examples of parametric methods
Logistic Regression Linear Regression Naive Bayes
63
Examples of non-parametric methods
K nearest neighbors | Decision trees
64
What does CART stand for?
Classification and Regression Trees
65
List the 6 steps of a analytics project
1. Understand the business problem and gain domain knowledge. 2. Explore the data 3. Clean and prepare the data 4. Model creation and tuning 5. Validate the model with train test 6. Deploy the model
66
Give an example of using test-control group in marketing.
We want to measure the effectiveness of a particular marketing campaign in terms of revenue generated. Select a target group of customers and split them randomly into a test and control group. Expose the test group to the marketing campaign and leave the control group untouched. Compare the additional revenue generated by the campaign.
67
Why is it important to use a test-control group in marketing?
Let's say you send out a marketing campaign to 100 customers and out of those 20 make a purchase. Without a test-control group it's is difficult to gauge how many of those 20 customers made a purchase because of the campaign or were going to make a purchase anyways.
68
Can't you just compare revenue and purchase rates for a similar customer group with previous data when there was no campaign running?
There are numerous factors that can affect a customers shopping behavior. That's why it's important to control for the time factor by using test-control group.
69
What is A/B testing?
A testing method of comparing two versions to see which one performs better.
70
Descriptive vs inferential statistics
Descriptive: When you have data on the entire population Inferential: When you only have a sample and want to describe the population
71
What are two tree pruning approaches?
Pre Pruning: Stopping the tree early | Post Pruning: Removing sub trees after tree has fully grown
72
What is the chi-square test of independence?
A test used to determine if there is a significant relationship between two variables.
73
What is pearson's correlation?
It tests for the strength of of the association between two continuous variables. measured between o and 1.
74
What is ANOVA?
Analysis of variance which is used to test the difference two or more means.
75
What is the 80-20 rule in business/stats?
A rule of thumb that 80% of outcomes can be attributed to 20% of all causes for a given event. e.g. 80% of revenue for a company comes from 20% of its customers
76
Ways to deal with unbalanced data?
- Collect more data - Use confusion matrix instead of accuracy as the metric - Resample data - Generate synthetic samples
77
What metric to measure the performance of XGBoost?
For classification we use model accuracy and a confusion matrix. For regression we use RMSE or MAE to evaluate the performance.
78
What is MAE?
Mean absolute error. The absolute value of the average of the errors.
79
RMSE vs MAE
RMSE gives higher weight to large errors since the residuals are squared. MAE is more interpretable as it is the average error
80
RMSE vs MAE
RMSE gives higher weight to large errors since the residuals are squared. MAE is more interpretable as it is the average error
81
Where was 84.51 created from?
Originally dunhumbyUSA and then was purchased by Kroger and renamed to 84.51.
82
What's exciting about working for 84.51 as a data scientist?
- Lots and lots of data. - Over 35 terabytes of data a week - Over 62 million households captured (around 50% of the US household population) - Multiple data sources
83
What are some of the other data sources 84.51 uses?
- research surveys - geospatial data - weather data - weblog traffic
84
Talk about some of 84.51 partnerships
- Partnered with IRI in 2016 to be able to evaluate business performance relative to the rest of the marker - Purchased Market6 in 2016 to gain better insights on product movement
85
When was 84.51 created?
April 2015
86
Why is marketing for CPG companies exciting?
- CPG companies are some of the leading media spenders in all of advertising. - About 31 0f the top 100 advertisers were CPG companies
87
What makes one kaggle competitors XGBoost model better than others?
A model is only as good as the data fed into it. Data preparation is the differentiating factor.
88
Names some sales uplift KPI's.
- Dollar sales - Dollars per household - Dollars per one thousand exposures
89
What is ROAS?
Return on ad spend
90
Name some marketing mediums
- In store display - Digital Coupons - Emails - Paper ads - Tv ads - Mobile ads
91
What is geofencing technology?
Tech that can be used to send targeted ads to consumers when they were near a store. Shown to be highly effective and in the case study had the highest uplift of any digital channel.
92
What is ANCOVA?
Analysis of covariance. It is a form ANOVA when covariates are present.
93
Why use ANCOVA?
It is useful when there may be confounding variables present that may bias the results. ANCOVA allows you to account for these confounding variables.
94
Examples of logistic regression in marketing
- Did the consumer buy the product more often after ad campaign. - Was my product or competitors product bought. - How did the exposure to marketing affect the likelihood of purchasing a product.
95
Examples of linear regression in marketing
- How much product was purchased | - Did advertising impact how many items were placed in the cart
96
Is more data always better?
-Most of the times but not necessarily. More data does not always mean more information, the quality of the data is very important as well.
97
Name some white papers you've read
-Cross channel measurement which looked at which marketing mediums were the most effective individually and in conjunction. Used the ANCOVA approach to isolate and measure sales lift