Machine Learning Flashcards

Question

What can you tell me about the normal distribution?

Answer 1

- Occurs naturally in many situations - Can apply the empirical rule - Symmetric curve - Standard normal model has mean of 0 and sd of 1

Answer 2

Interpolation is used in estimating data points within the range of your data. Extrapolations is used in estimating data points beyond the range of your data.

Answer 3

A type of unsupervised learning method in which the objective is to group the data into K groups. This can provide labels for the K clusters.

Answer 4

- Prior business knowledge may suggest a particular K. | - Plot the average distance from the centroid and choose the elbow point.

Answer 5

Increasing the number of predictors does not always improve the model as the dimensions get larger. For example in KNN when the dimensions are large, there may not be enough neighboring points to make predictions.

Answer 6

- Variable selection | - Dimension reduction techniques such as PCA.

Answer 7

It's a tradeoff between simpler and more complex models. Simpler models have high bias and low variance whereas complex models have low bias and high variance. Want to balance variance and bias to minimize total error. High variance occurs due to overfitting. High bias occurs due to lower predictability power.

Answer 8

Training loss and regularization.

Answer 9

A function used to measure the performance of a model given a set of parameters.

Answer 10

A tree is grown sequentially in an attempt to reduce the misclassification rate. The next tree is grown by giving a higher weight to misclassified points by the previous tree.

Answer 11

- Proven winner in kaggle competitions - Allows regularization - Handles missing values - Cross validation built in

Answer 12

Residuals need to be - Independent - Normally distributed - Have equal variance

Answer 13

P(success) = n choose k * p ^ k *(1-p)^k

Answer 14

L1 is ridge regression L2 is Lasso regression L1 encourages sparsity which can aid in variable selection

Answer 15

K means is an unsupervised method for clustering. KNN is a supervised method for classification.

Answer 16

It is the longitudinal coordinates for the cincinnati office.

Answer 17

Stuart Aitken

Answer 18

Limitless Minds, Fearless Hearts, and Relentless Delivery.

Answer 19

We fall in love with the problem and answer it with open minds and unbridled creativity.

Answer 20

We are all empowered to say and do what is right.

Answer 21

We love what we do and we see it through.

Answer 22

Donlen Scorecard - Created a template in Tableau - Use parameters in Tableau to automatically query data from SQL based on selected parameters

Answer 23

Structured data has organization and can be searched using simple algorithms. Unstructured data is essentially the opposite. It has no identifiable internal structure.

Answer 24

It is the science of collecting and analyzing data to for the purpose of producing meaningful insights.

Answer 25

It's a measure of whether the results of a test were due to chance. The higher the significance, the lower the odds of it happening by chance.

Answer 26

In simple terms it is the probability of the results occuring by chance. Hence the lower the p-value the stronger the evidence of rejecting the null.

Answer 27

A test used to compare two population means

Answer 28

A statistical measure of the relationship between two variables.

Answer 29

- Check with the business owner and understand the objectives to properly categorize the data. - Use iterative approach of pulling in new data and validating the model for accuracy. - Solicit feedback from stakeholders.

Answer 30

The process of collecting data from a population to form a sample allowing you to make some inferences about the population as a whole.

Answer 31

- SRS - Clustering - Systematic - Stratified

Answer 32

A statistical learning method used to predict a binary outcome using the logit function.

Answer 33

Used it in predicting the odds of a driver getting into an accident.

Answer 34

- Visually with a box plot or scatter plot - Normalizing the data and calculating z-scores - Use proximity based models for non-parametric problems

Answer 35

Parametric: We make an assumption about the functional form of f in f(x). Non-Parametric: No explicit assumption is made about the functional form of f.

Answer 36

The true function of the data may not follow a specific functional form. This allows f to be determined without restrictions.

Answer 37

- Interpretability - Faster learning - Less data needed

Answer 38

Logistic Regression Linear Regression Naive Bayes

Answer 39

K nearest neighbors | Decision trees

Answer 40

Classification and Regression Trees

Answer 41

1. Understand the business problem and gain domain knowledge. 2. Explore the data 3. Clean and prepare the data 4. Model creation and tuning 5. Validate the model with train test 6. Deploy the model

Answer 42

We want to measure the effectiveness of a particular marketing campaign in terms of revenue generated. Select a target group of customers and split them randomly into a test and control group. Expose the test group to the marketing campaign and leave the control group untouched. Compare the additional revenue generated by the campaign.

Answer 43

Let's say you send out a marketing campaign to 100 customers and out of those 20 make a purchase. Without a test-control group it's is difficult to gauge how many of those 20 customers made a purchase because of the campaign or were going to make a purchase anyways.

Answer 44

There are numerous factors that can affect a customers shopping behavior. That's why it's important to control for the time factor by using test-control group.

Answer 45

A testing method of comparing two versions to see which one performs better.

Answer 46

Descriptive: When you have data on the entire population Inferential: When you only have a sample and want to describe the population

Answer 47

Pre Pruning: Stopping the tree early | Post Pruning: Removing sub trees after tree has fully grown

Answer 48

A test used to determine if there is a significant relationship between two variables.

Answer 49

It tests for the strength of of the association between two continuous variables. measured between o and 1.

Answer 50

Analysis of variance which is used to test the difference two or more means.

Answer 51

A rule of thumb that 80% of outcomes can be attributed to 20% of all causes for a given event. e.g. 80% of revenue for a company comes from 20% of its customers

Answer 52

- Collect more data - Use confusion matrix instead of accuracy as the metric - Resample data - Generate synthetic samples

Answer 53

For classification we use model accuracy and a confusion matrix. For regression we use RMSE or MAE to evaluate the performance.

Answer 54

Mean absolute error. The absolute value of the average of the errors.

Answer 55

RMSE gives higher weight to large errors since the residuals are squared. MAE is more interpretable as it is the average error

Answer 56

RMSE gives higher weight to large errors since the residuals are squared. MAE is more interpretable as it is the average error

Answer 57

Originally dunhumbyUSA and then was purchased by Kroger and renamed to 84.51.

Answer 58

- Lots and lots of data. - Over 35 terabytes of data a week - Over 62 million households captured (around 50% of the US household population) - Multiple data sources

Answer 59

- research surveys - geospatial data - weather data - weblog traffic

Answer 60

- Partnered with IRI in 2016 to be able to evaluate business performance relative to the rest of the marker - Purchased Market6 in 2016 to gain better insights on product movement

Answer 61

April 2015

Answer 62

- CPG companies are some of the leading media spenders in all of advertising. - About 31 0f the top 100 advertisers were CPG companies

Answer 63

A model is only as good as the data fed into it. Data preparation is the differentiating factor.

Answer 64

- Dollar sales - Dollars per household - Dollars per one thousand exposures

Answer 65

Return on ad spend

Answer 66

- In store display - Digital Coupons - Emails - Paper ads - Tv ads - Mobile ads

Answer 67

Tech that can be used to send targeted ads to consumers when they were near a store. Shown to be highly effective and in the case study had the highest uplift of any digital channel.

Answer 68

Analysis of covariance. It is a form ANOVA when covariates are present.

Answer 69

It is useful when there may be confounding variables present that may bias the results. ANCOVA allows you to account for these confounding variables.

Answer 70

- Did the consumer buy the product more often after ad campaign. - Was my product or competitors product bought. - How did the exposure to marketing affect the likelihood of purchasing a product.

Answer 71

- How much product was purchased | - Did advertising impact how many items were placed in the cart

Answer 72

-Most of the times but not necessarily. More data does not always mean more information, the quality of the data is very important as well.

Answer 73

-Cross channel measurement which looked at which marketing mediums were the most effective individually and in conjunction. Used the ANCOVA approach to isolate and measure sales lift

Machine Learning Flashcards

(97 cards)