Machine Learning Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Precision/Recall Tradeoff

A

Firstly, both precision and recall are terms used within a classification context in which a model predicts a particular response (in this case, a binary one, such as has disease / does not have disease). Thus, possible outcomes can be false positive, true positive, false negative, and false negative, with the first part of the term referring to whether the model prediction was correct (true) or incorrect (false) and the second part referring to the prediction output by the model (positive or negative). So, in a false positive, for example, the model predicts a positive but was incorrect in doing so. (For example, it predicted the presence of a disease when in reality there was no disease.)

Precision is the number of true positives divided by the sum of true positives and false positives, i.e. TP / TP/FP

. It is the proportion of the positive classifier predictions that were, in fact, positive. Maximizing this is important since we want our model to be correct when predicting a positive class.

Recall is the number of true positives divided by the sum of true positives and false negatives, i.e. TP/TP+FN
TP+FN
TP

. Therefore, it is the percentage of total positive cases captured. Recall is important to maximize since we want our model to capture all of the positive cases in the relevant universe.

Both metrics are relevant in evaluating a classification algorithm, but we need to do that carefully, because a change in one leads to a change in the other, and so maximizing both is not an easy task. The tradeoff between the two metrics arises from what aspect of the problem being analyzed is more important in the context of the classification problem at hand.

For example, in the prior example, by not identifying a life-threatening disease in a patient (a false negative), the patient’s life is being put at risk. Likewise with a false positive (deeming someone as having the disease when they do not), the patient may undergo treatment having unpleasant, harmful side effects. Since there may be relative weightings between these outcomes, a tradeoff between precision and recall needs to be considered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

At Uber, one important attribute for every ride booked is the MSA (Metropolitan Statistical Area) of where the ride originated. An MSA encompasses one or more economically tied cities, and it’s encompassing areas, and can even expand beyond state lines.

For example, the Washington-Baltimore-Arlington MSA covers DC and parts of Maryland and Virginia.

As opposed to using city and state name, MSA’s are a good feature for a variety of models at Uber because rider and driver market conditions tend to be similar within an MSA.

However, in the US, there are hundreds of MSA’s, and internationally, there are thousands of MSAs (and their country-specific equivalents).

How would you encode this MSA variable when building a Machine Learning model?

A

If you were building a regression model, target encoding is one option. You’d replace each MSA with the mean of the target variable for that category.

For example, if you were building a model to predicting the average number of minutes a rider had to wait to be picked up, for the NYC-Jersey MSA you’d just encode 6.2 minutes, but for the DC-Baltimore-Arlington you’d encode 9.8 minutes.

If this was a binary classification problem – like trying to predict whether a ride request would be cancelled before they were picked up – you could encode the column by using the conditional probability that they rider cancelled.

For example, the NYC-Jersey MSA might be encoded as 0.3, but the DC-Baltimore-Arlington MSA would be encoded as 0.15.

One-hot encoding could be another option, because it converts a categorical value into numerical values.

You would take the MSA column, and convert it into thousands of new columns, and assign a binary 1 or 0 to these column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Assumptions of Linear Regression

A

The main assumptions underlying linear regression are the following:

a) Linearity: The relationship between the feature set XX and the target variable YY is linear.

b) Homoscedasticity: The variance of the residuals is constant.

c) Independence: All observations are independent of one another.

d) Normality: The distribution of Y is assumed to be Normal.

With respect to independence and normality, use of the term “i.i.d.” (independent and identically distributed) is common. If any of these assumptions are violated, any forecasts or confidence intervals based on the results of using the model will most likely be misleading or biased. The linear regression is likely to perform poorly out of sample as a result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Bias-Variance Tradeoff [Fa

A

The bias-variance tradeoff is expressed as the following: Total model error = Bias + Variance + Irreducible error.

Flexible models tend to have low bias and high variance, whereas more rigid models have high bias and low variance. The bias term comes from the error that occurs when a model underfits data. Having a high bias means that the machine learning model is too simple and may not adequately capture the relationship between the features and the target. An example would be using linear regression when the underlying relationship is nonlinear.

The variance term represents the error that occurs when a model overfits data. Having a high variance means that a model is susceptible to changes in training data, meaning that it is capturing and so reacting to too much noise. An example would be using a very complex neural network when the true underlying relationship between the features and the target is simply a linear one.

The irreducible term is the error that cannot be addressed directly by the model, such as noise in data measurement.

When creating a machine learning model, we want both bias and variance to be low, because we want to be able to have a model that predicts well but that also doesn’t change much when it is fed new data. The key point here is to prevent overfitting and, at the same time, to attempt to retain sufficient accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Say you need to employ a binary classifier for the purpose of fraud detection. What metrics would you look at, how is each defined, and what is the interpretation of each?

A

Some main metrics to compute for a binary classifier are precision, recall, and the AUC (Area Under the Curve) of the ROC curve. Let TP = true positive, FP = false positive, TN = true negative, and FN = false negative.

Then, precision is defined as TP/(TP + FP). It answers the question, “What percent of fraudulent predictions were correctly identified?”, and is important to maximize since you want your classifier to be as correct as possible in identifying a transaction as fraudulent.

Recall, on the other hand, is defined by TP/(TP + FN). It answers the question, “What percent of fraudulent cases were caught?”, and it is important to maximize since you want your classifier to have caught as many fraudulent cases as possible.

AUC of the ROC curve is defined by the area under the ROC curve, which is constructed by plotting the true positive rate TP/(TP + FN) versus the false positive rate FP/(FP + TN) for various thresholds, which can determine whether a data point should be labeled fraudulent or not fraudulent. The area under this curve, or AUC, is a measure of separability of the model, and the closer to 1 that it is, the higher is the measure. It answers the question, “Is my classifier able to discriminate between fraud and not-fraud effectively” and is important to maximize since you want your classifier to separate the two classes properly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe why overfitting is a major concern for neural networks. What are some ways overfitting can be addressed?

A

Neural networks are prone to overfitting since they can approximate any function by involving many complex parts including their interactions (various activation functions, possibly many hidden layers, etc.) and hence are, in general, high variance and low bias.

Overfitting can be a concern for any ML model because it will most likely lead to poor out-of-sample predictions. With respect to neural networks, this issue can be even more serious because they tend to adapt very well to the training data, since they are capable of approximating any arbitrarily complex function. To overcome this problem, we can do the following:

a) Reduce model complexity by changing the network structure or/and the parameters b) Train the network using an increased amount of data, thereby making the model better able to generalize c) Use regularization methods, such as adding a penalty to the loss function in order to “force” a decrease in the feature weight values, using dropout (i.e., stochastically removing inputs during training), or adding noise to the inputs during training

In addition, it helps to normalize both input values and to apply batch normalization so that the weights do not vary drastically during training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Compare and contrast gradient boosting and random forests.

A

In both gradient boosting and random forests, an ensemble of decision trees are used. Additionally, both are flexible models and don’t need much data pre-processing.

However, there are two main differences. The first main difference is that, in gradient boosting, trees are built one at a time, such that successive weak learners learn from the mistakes of preceding weak learners. In random forests, the trees are built independently at the same time.

The second difference is in the output: Gradient boosting combines the results of the weak learners with each successive iteration, whereas, in random forests, the trees are combined at the end (through either averaging or majority).

Because of these structural differences, gradient boosting is often more prone to overfitting than are random forests due to their focus on mistakes over training iterations and the lack of independence in tree building. Additionally, gradient boosting hyperparameters are harder to tune than those of random forests.

Finally, gradient boosting may take longer to train than random forests because the trees of the latter are built sequentially. In real-life applications, gradient boosting generally excels when used on unbalanced datasets (fraud detection, for example), whereas random forests generally excel at multi-class object detection with noisy data (computer vision, for example).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the motivation behind random forests. What are two ways in which they improve upon individual decision trees?

A

Random forests are used since individual decision trees are usually prone to overfitting. Not only can these utilize multiple decision trees and then average their decisions, but they can be used for either classification or regression. There are a few main ways in which they allow for stronger out-of-sample prediction than do individual decision trees.

a) As in other ensemble models, using a large set of trees created in a resample of the data (bootstrap aggregation) will lead to a model yielding more consistent results. More specifically, and in contrast to decision trees, it leads to diversity in training data for each tree and so contributes to better results in terms of bias-variance tradeoff (particularly with respect to variance).

b) Using only m < pm<p features at each split helps to de-correlate the decision trees, thereby preventing very important features from always appearing at the first splits of the trees (which would happen on standalone trees due to the nature of information gain). c) They’re fairly easy to implement and fast to run. d) They can produce very interpretable feature-importance values, thereby improving model understandability and feature selection.

Points A and B are the main keys for improving upon single decision trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe the idea behind boosting. Give an example of one boosting method, and describe one advantage and one disadvantage of this method.

A

Boosting is a type of ensemble model that trains a sequence of “weak” models (such as small decision trees), where each one sequentially compensates for the weaknesses of the preceding models. Such weaknesses can be measured by the current model’s error rate and can be used to weight/define on which observations the next models should focus. Each training point within a dataset is assigned a particular weight and then is continually re-weighted in an iterative fashion; in other words, points that are predicted incorrectly take on higher weights in each iteration. In this way, more emphasis is placed on and more training involves points that are harder to predict.

One popular example is AdaBoost (adaptive boosting), which is used for classification. This method sequentially combines decision trees with a single split and works as follows:

Weights are uniformly set for all data points.

For MM classifiers, perform the following steps:

a) Construct a classifier for which the data are weighted by the previous steps’ weights

b) The weighted error for that classifier is computed.

c) Data points are re-weighted according to whether each was classified correctly or incorrectly by the classifier. After MM iterations, we can sum the weighted predictions of each classifier to obtain a final prediction.

One advantage of boosting is that it is powerful and flexible, since it can accommodate a variety of loss functions and can handle data in various forms (including imputation). On the flip side, disadvantages it has are that the training process can over-emphasize outliers and noise and hence cause overfitting. In addition, the whole process can be computationally intensive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Say you are running a simple logistic regression to solve a problem but find the results to be unsatisfactory.

What are some ways you might improve your model and which other models might you consider using instead?

A

There are several ways to improve the performance of a logistic regression:

a) Normalizing features: The features should be normalized such that particular weights do not dominate within the model.

b) Adding more features: Depending on the problem, it may simply be the case that there aren’t enough useful features. In general, logistic regression is high bias, so adding more features should be helpful.

c) Addressing outliers: Identify and decide whether to retain or remove them.

d) Selecting variables: See if any features have introduced too much noise into the process.

e) Cross-validation and hyperparameter tuning: Using k-fold cross-validation along with hyperparameter tuning (for example, introducing a penalty term for regularization purposes) should help improve the model.

f) The classes may not be linearly separable (logistic regression produces linear decision boundaries), and, therefore, it would be worth looking into SVMs, tree-based approaches, or neural networks instead.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Say you just ran a linear regression model. Describe the diagnostics you would use to assess the validity of your results?

A

For linear regression, simply looking at R^2R
2
can be misleading. It is important to check for the following diagnostics, which can be easily accessed visually.

a) Residual values vs. fitted values: We can look at a scatterplot of residual values versus fitted values. If there are nonlinear patterns here, then they indicate nonlinear relationships within the data. In addition, if the variance of the residuals is not constant, this indicates the presence of heteroscedasticity.

b) QQ-plot: Plotting quantiles of standardized residuals versus theoretical quantiles. If the residuals are normally distributed, this plot should be close to a straight, dashed line. If the QQ-plot indicates that the residuals are not normally distributed, then a transformation (for example, a log transformation of the response) may be necessary.

c) Cook’s distance: Cook’s distance and leverage can identify highly influential data points (outliers). Conceptually, it calculates the regression model with and without any given data point and observes how the regression model changes. If highly influential data points are present, then it is worth considering whether to remove or cap those particular values.

d) Scale-location plot: Plots standardized residuals vs the fitted values - if there is a horizontal line with equally spread points then it indicates homoscedasticity, otherwise there is heteroscedasticity.

When performing these diagnostics, identifying what may have gone wrong is important. For example, lookout for nonlinearity of the relationship being modeled, bias due to omitted variables, poor sampling of observations in the distribution’s tails, and so on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Motivation for cross-validation

A

Cross-validation is a technique used to assess the performance of an algorithm in several resamples/subsamples of training data. It consists of running the algorithm on subsamples of the training data, such as the original data less some of the observations comprising the training data, and evaluating model performance on the portion of the data that was not present in the subsample.

This process is repeated many times for the different subsamples, and the results are combined at the end. This step is very important in ML because it reveals the quality and consistency of the model’s true performance.

The process is as follows:

Randomly shuffle data into kk equally-sized blocks (folds).

For each ii in fold 1…k1…k, train the model on all the data except for fold ii, and evaluate the validation error using block ii.

Average the kk validation errors from step 2 to get an estimate of the true error.

This process aids in accomplishing the following: 1) Avoiding training and testing on the same subsets of data points, which would lead to overfitting, and 2) Avoiding using a dedicated validation set, with which no training can be done. The second of these points is particularly important in cases where very little training data is available or the data collection process is expensive. One drawback of this process, however, is that roughly kk times more computation is needed than using a dedicated hold-out validation set. In practice, cross-validation works very well for smaller datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Generative and Discriminative Models

A

Both generative and discriminative models are used in the context of classification in assigning a given data point to one of K possible classes. Generative models look at the joint probability distribution between XX and YY. For any given input, we want to classify an arbitrary data point xx with the following class label:

\hat{y} = \argmax_{k} p(x, Y=k)
y


=
k
argmax

p(x,Y=k)
The joint distribution between XX and YY is given by:

p(X, Y) = p(Y|X)p(X)
p(X,Y)=p(Y∣X)p(X)
and for each class kk we have

p_k(X) = p(X|k) p(k)
p
k

(X)=p(X∣k)p(k)
such that

\hat{y} = \argmax_{k} p(Y=k|x)
y


=
k
argmax

p(Y=k∣x)
This type of model is concerned with how the feature set is generated in order to produce classifications, explaining why it models the joint distribution of the features and the target. We call it “generative” because it can also be useful to generate new data points. A common type of generative model is the naive Bayes algorithm.

Discriminative models directly learn a decision boundary by choosing a class that maximizes the following:

\hat{y} = \argmax_{k} p(Y=k|x)
y


=
k
argmax

p(Y=k∣x)
Classification models are frequently discriminative models, since they’re concerned with finding the probability of each class given a known feature set. Examples of such models are logistic regression and neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Information Gain and Entropy

A

Because information gain is based on entropy, we’ll discuss entropy first.

The formula for entropy is

Entropy = Sum of P(Y=k)logP(Y=k)

The equation above yields the amount of entropy present and shows exactly how homogeneous a sample is (based on the attribute being split). Consider a case where k = 2k=2. Let aa and bb be two outputs/labels that we are trying to classify. Given these values, the formula considers the proportion of values in the sample that are aa and the proportion that are bb, with the sample being split on a different attribute.

A completely homogeneous sample will have an entropy of 0. For instance, if a given attribute has values aa and bb, then the entropy of splitting on that given attribute would be(The below is log based 2)

Entropy=−1∗Log(1)−0∗Log(0)=0

whereas a completely split (50%-50%) would result in an entropy of 1. A lower entropy means a more homogeneous sample.

Information gain is based on the decrease in entropy after splitting on an attribute.

IG(X_j, Y) = H(Y) - H(Y|X_j)

This concept is better explained with a simple numerical example. Consider the above case again with k=2k=2. Let’s say there are 5 instances of value aa and 5 instances of value bb. Then, we decide to split on some attribute XX. When X=1X=1, there are 5 aa’s and 1 bb’s whereas when X = 0X=0 there are 4 bb’s and 0 aa’s.

Now, by splitting on XX, we have two classes: X = 1X=1 and X = 0X=0. However, by splitting on this attribute, we now have X =1X=1, which has 5 aa’s and 1 bb, while X = 0X=0 has 4 bb’s and 0 aa’s.

Entropy(After)=(4/10)∗Entropy(X=0)−(6/10)∗Entropy(X=1)

The entropy value for X = 0X=0 is 0, since the sample is homogeneous (all bb’s, no aa’s).

Entropy(X=0)=0
\text{Entropy}(X = 1) = -(1/6)Log_2(1/6) - (5/6)Log_2(5/6) = .65
Entropy(X=1)=−(1/6)∗Log
2

(1/6)−(5/6)∗Log
2

(5/6)=.65

(1/6)−(5/6)∗Log
2

(5/6)=.65
Plug these into the entropy(after) formula to obtain the following:

\text{Entropy}(\text{After}) = (4/10)0 - (6/10).65 = .39
Entropy(After)=(4/10)∗0−(6/10)∗.65=.39
Finally, we can go back to our original formula and obtain information gain:

IG= 1 - .39 = .61
IG=1−.39=.61
This results makes intuitive sense, since, ideally, we want to split on an attribute that splits the output perfectly. Therefore we ideally want to split on something that is homogeneous with regards to the output, and this something would thus have an entropy equal to 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Identifying synonyms

A

To find synonyms, we can first find word embeddings through a corpus of words. Word2vec is a popular algorithm for doing so. It produces vectors for words based on the words’ contexts. Vectors that are closer in Euclidean distance are meant to represent words that are also closer in context and meaning.

The generated word embeddings are weights on the resulting vectors. The distance between these vectors can be used to measure similarity, for example, via cosine similarity or some other similar measure.

Once we have these word embeddings, we can then run an algorithm such as K-means clustering to identify clusters within word embeddings or run a K-nearest neighbors algorithm to find a particular word for which we want to find synonyms.

However, there are some edge cases since Word2Vec can produce similar vectors even in the case of antonyms; consider the words “hot” and “cold”, for example, which have opposite meanings but appear in many similar contexts (related to temperature or in a Katy Perry song).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Suppose we want to perform PCA on a cardiovascular health dataset with multiple features, but each feature has vastly different ranges.

For example, resting heart rate ranges from 20 to 250 beats-per-minute, but “level-of-physical-activity” is a score that ranges from 1 to 4 (where 1 represents “sedentary” and 4 represents “very active”).

Will PCA work on this dataset?

A

PCA will work, but we have to standardize our features first.

Larger ranges tend to mean larger variances. For example, the variance of resting heart rates might be 64 beats-per-minute, but the variance in the physical activity score might just be 1.5.

If some features have a large variance comparatively to other features, just because of the scale they were measured in, these larger-variance features will tend to dominate the first principle component.

That’s why it’s crucial to standardize our features first!

17
Q

What are L1 and L2 regularization? What are the differences between the two?

A

In machine learning, L1 and L2

penalization are both regularization methods that prevent overfitting by coercing the coefficients of a regression model towards zero. The difference between the two methods is the form of the penalization applied to the loss function.

For a regular regression model, assume the loss function is given by LL. Using L1 regularization, the least absolute shrinkage and selection operator, or LASSO, adds the absolute value of the coefficients as a penalty term. Meanwhile, Ridge regression uses L2 regularization, that is, adding the squared magnitude of the coefficients as the penalty term.

The loss function for the two are thus the following:

Loss(L_1) = L + lambda|w_i|
Loss(L_2) = L + lambda|w^2_i|

If we run gradient descent on the weights ww, L_1L
1

regularization forces any weight closer to 0, irrespective of its magnitude. Meanwhile, for L_2L
2 regularization, the rate at which the weight approaches 0 becomes slower as its value becomes smaller. Because of this, L_1L
1

is more likely to “zero” out particular weights and hence completely remove certain features from the model, leading to models of increased sparseness.

18
Q

STochastic Gradient Descent

A

The gradient descent algorithm takes small steps in the direction of steepest descent to optimize a particular objective function. The size of the ‘steps’ the algorithm takes are proportional to the negative gradient of the function at the current value of the parameter being sought.

The stochastic version of the algorithm, SGD, uses an approximation of the non-stochastic gradient descent algorithm instead of the function’s actual gradient. This estimate is done by using only one randomly selected sample at each step to evaluate the derivative of the function, making this version of the algorithm much faster and more attractive for situations involving lots of data. SGD is also useful when redundancy in the data is present (i.e., observations that are very similar).

Assume function ff at some point xx and at time tt. Then, the gradient descent algorithm will update xx as follows until it reaches convergence:

x^t+1 = x^t - α_t ∇ f(x^t)

that is, we calculate the negative of the gradient of ff, scale that by some constant, and move in that direction at the end of each iteration.

Since many loss functions are decomposable into the sum of individual functions, then the gradient step can be broken down into addition of discrete, separate gradients. However, for very large datasets, this process can be computationally intensive, and the algorithm might become stuck at local minima or at saddle points.

Therefore, we use the stochastic gradient descent algorithm to obtain an unbiased estimate of the true gradient without going through all data points by uniformly selecting a point at random and performing a gradient update then and there.

The estimate is therefore unbiased since we have the following:

∇f(x) = 1/n ∇f_i(x)

Since the data are assumed to be i.i.d., for the SGD, the expectation of g(x)g(x) is:

E[g(x)]=∇f(x)

g(x) is the stochastic gradient descent.

19
Q

Changing ROC Curve

A

Recall that the ROC curve plots the true positive rate versus the false positive rate. If all scores change simultaneously, then none of the actual classifications change (since thresholds are adjusted), leading to the same true positive and false positive rates since only the relative ordering of the scores matters.

Therefore, taking a square root would not cause any change to the ROC curve because the relative ordering has been maintained. If one application had a score of XX and another a score of

YY and if Y > XY>X, then sqrt{Y} > sqrt{X}

still. Only the model thresholds would change.

In contrast, any function that is not monotonically increasing would change the ROC curve, since the relative ordering would not be maintained. Some simple examples are the following: f(x) = -x, f(x) = -x^2f(x)=−x,f(x)=−x
2
, or a stepwise function.

20
Q

Compare and contrast Gaussian Naive Bayes (GNB) and Logistic Regression. When would you use one over the other?

A

Both Gaussian Naive Bayes (GNB) and logistic regression can be used for classification. The two models each have advantages and disadvantages, which provide the answer as to which to choose under what circumstances. These are discussed below, along with their similarities and differences:

Advantages:

GNB requires only a small number of observations to be adequately trained; it is also easy to use and reasonably fast to implement; interpretation of the results produced by GNB can also be highly useful.
Logistic regression has a simple interpretation in terms of class probabilities, and it allows inferences to be made about features (i.e., variables) and identification of the most relevant of these with respect to prediction.
Disadvantages:

By assuming features (i.e., variables) to be independent, GNB can be wrongly employed in problems where that does not hold true, a very common occurrence.
Not highly flexible, logistic regression may fail to capture interactions between features and so may lose prediction power. This lack of flexibility can also lead to overfitting if very little data are available for training.
Differences:

Since logistic regression directly learns P(Y|X)P(Y∣X), it is a discriminative classifier, whereas GNB directly estimates P(Y)P(Y) and P(X|Y)P(X∣Y) and so is a generative classifier.

Logistic regression requires an optimization setup (where weights cannot be learned directly through counts) whereas GNB requires no such setup.

Similarities:

Both methods are linear decision functions generated from training data.

GNB’s implied P(Y|X)P(Y∣X) is the same as that of logistic regression (but with particular parameters).

Given these advantages and disadvantages, logistic regression would be preferable assuming training data size is not an issue, since the assumption of conditional independence breaks down if features are correlated. However, in cases where training data are limited or the data-generating process includes strong priors, using GNB may be preferable.

21
Q

Lead Scoring Algo

A

Step 1: Clarify Lead Scoring Requirements

Lead scoring is the process of assigning numerical scores for any leads (potential customers) in a business. Lead scores can be based on a variety of attributes, and help sales and marketing teams prioritize leads to try and convert them to customers.

As always, it’s smart to ask the interviewer clarifying questions. In this case, learning more about the requirements for the lead scoring algorithm is critical. Questions to ask include:

Are we building this for our own company’s sales leads? Or, are we building an extensible version as part of the Salesforce product?
Are there any business requirements behind the lead scoring (i.e., does it need to be easy to explain internally and/or externally)?
Are we running this algorithm only on companies in our sales database (CRM), or looking at a larger landscape of all companies?
For our solution, we’ll assume the interviewer means we want to develop a leading scoring model to be used internally — that means using the company’s internal sales data to predict whether a prospective company will purchase a Salesforce product.

Step 2: Explain the Features You’d Use

Some elements which should influence whether a prospective company turns into a customer:

Firmographic Data: What type of company is this? Industry? Amount of revenue? Employee count?
Marketing Activity: Have they interacted with marketing materials, like clicking on links within email marketing campaigns? Have employees from that company downloaded whitepapers, read case studies, or clicked on ads? If so, how much activity has there been recently?
Sales Activity: Has the prospective company interacted with sales? How many sales meetings took place, and how recently did the last one take place?
Deal Details: What products are being bought? Some might be harder to close than others. How many seats (licenses) are being bought? What’s the size of the deal? What’s the contract length?
After selecting features, it is good to conduct the standard set of feature engineering best practices. Note that the model will only be as good as the data and judgement in feature engineering applied — in practice, many companies that predict lead scoring can face issues with missing data or lack of relevant data.

Step 3: Explain Models You’d Use

Lead scoring can be done through building a binary classifier that predicts the probability of a lead converting. **In terms of model selection, logistic regression offers a straightforward solution with an easily interpretable result: the resulting log-odds is a probability score for, in this case, purchasing a particular item. However, it cannot capture complex interaction effects between different variables and could also be numerically unstable under certain conditions (i.e., correlated covariates and a relatively small user base).

An alternative to logistic regression would be to use a more complex model, such as a neural network or an SVM. Both are great with dealing with high-dimensional data and with capturing the complex interactions that logistic regression cannot. However, unlike logistic regression, neither is easy to explain, and both generally require a large amount of data to perform well.

A suitable compromise are tree-based models, such as random forests or XGBoost, which typically perform well. With tree-based models, the features that have the highest influence on predictions are readily perceived, a characteristic that could be very useful in this particular case.

Step 4: Model Deployment Nuances

Lastly, it is important to monitor for feature shifts and/or model degradations. As the product line and customer base changes over time, models trained on old data may not be as relevant. For a mature company like Salesforce, for example, it’s very likely that companies signing up now aren’t exactly like the companies that signed up with Salesforce 5 or 10 years ago. That’s why it’s important to monitor feature drift, and continuously update the model.

22
Q

Convex Functions

A

Mathematically, a convex function ff satisfies the following for any two points xx and yy in the domain of ff:

f((1-\alpha)x + \alpha y) \leq (1-\alpha)f(x) + \alpha f(y), 0 \leq \alpha \leq 1
f((1−α)x+αy)≤(1−α)f(x)+αf(y),0≤α≤1
That is, the line segment from xx to yy lies above the function graph of ff for any points xx and yy. Convexity matters because it has implications about the nature of minima in ff. Stated more specifically, any local minimum of ff is also a global minimum.

Neural networks provide a significant example of non-convex problems in machine learning. At a high level, this is because neural networks are universal function approximators, meaning that they can (with a sufficient number of neurons) approximate any function arbitrarily well. Because not all functions are convex (convex functions cannot approximate non-convex ones), by definition, they must be non-convex.

In particular, the cost function for a neural network has a number of local minima; you could interchange parameters of different nodes in various layers and still obtain exactly the same cost function output (all inputs/outputs the same, but with nodes swapped). Therefore, there is no particular global minima, and so neural networks cannot be convex.