Part 1 Flashcards by Kyle Black

Q: What are some of the steps for data wrangling and data cleaning before applying machine learning algorithms?

There are many steps that can be taken when data wrangling and data cleaning. Some of the most common steps are listed below:
Data profiling: Almost everyone starts off by getting an understanding of their dataset. More specifically, you can look at the shape of the dataset with .shape and a description of your numerical variables with .describe().
Data visualizations: Sometimes, it’s useful to visualize your data with histograms, boxplots, and scatterplots to better understand the relationships between variables and also to identify potential outliers.
Syntax error: This includes making sure there’s no white space, making sure letter casing is consistent, and checking for typos. You can check for typos by using .unique() or by using bar graphs.
Standardization or normalization: Depending on the dataset your working with and the machine learning method you decide to use, it may be useful to standardize or normalize your data so that different scales of different variables don’t negatively impact the performance of your model.
Handling null values: There are a number of ways to handle null values including deleting rows with null values altogether, replacing null values with the mean/median/mode, replacing null values with a new category (eg. unknown), predicting the values, or using machine learning models that can deal with null values. Read more here.
Other things include: removing irrelevant data, removing duplicates, and type conversion.

How well did you know this?

Not at all

Perfectly

Q: How to deal with unbalanced binary classification?

There are a number of ways to handle unbalanced binary classification (assuming that you want to identify the minority class):
First, you want to reconsider the metrics that you’d use to evaluate your model. The accuracy of your model might not be the best metric to look at because and I’ll use an example to explain why. Let’s say 99 bank withdrawals were not fraudulent and 1 withdrawal was. If your model simply classified every instance as “not fraudulent”, it would have an accuracy of 99%! Therefore, you may want to consider using metrics like precision and recall.
Another method to improve unbalanced binary classification is by increasing the cost of misclassifying the minority class. By increasing the penalty of such, the model should classify the minority class more accurately.
Lastly, you can improve the balance of classes by oversampling the minority class or by undersampling the majority class. You can read more about it here.

How well did you know this?

Not at all

Perfectly

Q: What is the difference between a box plot and a histogram?

While boxplots and histograms are visualizations used to show the distribution of the data, they communicate information differently.
Histograms are bar charts that show the frequency of a numerical variable’s values and are used to approximate the probability distribution of the given variable. It allows you to quickly understand the shape of the distribution, the variation, and potential outliers.
Boxplots communicate different aspects of the distribution of data. While you can’t see the shape of the distribution through a box plot, you can gather other information like the quartiles, the range, and outliers. Boxplots are especially useful when you want to compare multiple charts at the same time because they take up less space than histograms.

How well did you know this?

Not at all

Perfectly

Q: Describe different regularization methods, such as L1 and L2 regularization?

Both L1 and L2 regularization are methods used to reduce the overfitting of training data. Least Squares minimizes the sum of the squared residuals, which can result in low bias but high variance.
L2 Regularization, also called ridge regression, minimizes the sum of the squared residuals plus lambda times the slope squared. This additional term is called the Ridge Regression Penalty. This increases the bias of the model, making the fit worse on the training data, but also decreases the variance.
If you take the ridge regression penalty and replace it with the absolute value of the slope, then you get Lasso regression or L1 regularization.
L2 is less robust but has a stable solution and always one solution. L1 is more robust but has an unstable solution and can possibly have multiple solutions.

How well did you know this?

Not at all

Perfectly

Q: Explain a Neural Network.

A neural network is a multi-layered model inspired by the human brain. Like the neurons in our brain, the circles above represent a node. The blue circles represent the input layer, the black circles represent the hidden layers, and the green circles represent the output layer. Each node in the hidden layers represents a function that the inputs go through, ultimately leading to an output in the green circles. The formal term for these functions is called the sigmoid activation function.

How well did you know this?

Not at all

Perfectly

Q: What is cross-validation?

Cross-validation is essentially a technique used to assess how well a model performs on a new independent dataset. The simplest example of cross-validation is when you split your data into two groups: training data and testing data, where you use the training data to build the model and the testing data to test the model.

How well did you know this?

Not at all

Perfectly

Q: How to define/select metrics?

There isn’t a one-size-fits-all metric. The metric(s) chosen to evaluate a machine learning model depends on various factors:
Is it a regression or classification task?
What is the business objective? Eg. precision vs recall
What is the distribution of the target variable?
There are a number of metrics that can be used, including adjusted r-squared, MAE, MSE, accuracy, recall, precision, f1 score, and the list goes on.

How well did you know this?

Not at all

Perfectly

Q: Explain what precision and recall are

Recall attempts to answer “What proportion of actual positives was identified correctly?”

RECALL = TP/ (TP +FP)

Precision attempts to answer “What proportion of positive identifications was actually correct?”

PRECISION = TP/(TP +FP)

How well did you know this?

Not at all

Perfectly

Q: Explain what a false positive and a false negative are. Why is it important these from each other? Provide examples when false positives are more important than false negatives, false negatives are more important than false positives and when these two types of errors are equally important

A false positive is an incorrect identification of the presence of a condition when it’s absent.
A false negative is an incorrect identification of the absence of a condition when it’s actually present.
An example of when false negatives are more important than false positives is when screening for cancer. It’s much worse to say that someone doesn’t have cancer when they do, instead of saying that someone does and later realizing that they don’t.
This is a subjective argument, but false positives can be worse than false negatives from a psychological point of view. For example, a false positive for winning the lottery could be a worse outcome than a false negative because people normally don’t expect to win the lottery anyways.

How well did you know this?

Not at all

Perfectly

Q: What is the difference between supervised learning and unsupervised learning? Give concrete examples

Supervised learning involves learning a function that maps an input to an output based on example input-output pairs [1].
For example, if I had a dataset with two variables, age (input) and height (output), I could implement a supervised learning model to predict the height of a person based on their age.

Unlike supervised learning, unsupervised learning is used to draw inferences and find patterns from input data without references to labeled outcomes. A common use of unsupervised learning is grouping customers by purchasing behavior to find target markets.

How well did you know this?

Not at all

Perfectly

Q: Assume you need to generate a predictive model using multiple regression. Explain how you intend to validate this model

A) Adjusted R-squared.
R Squared is a measurement that tells you to what extent the proportion of variance in the dependent variable is explained by the variance in the independent variables. In simpler terms, while the coefficients estimate trends, R-squared represents the scatter around the line of best fit.
However, every additional independent variable added to a model always increases the R-squared value — therefore, a model with several independent variables may seem to be a better fit even if it isn’t. This is where adjusted R² comes in. The adjusted R² compensates for each additional independent variable and only increases if each given variable improves the model above what is possible by probability. This is important since we are creating a multiple regression model.

B) Cross-Validation
A method common to most people is cross-validation, splitting the data into two sets: training and testing data. See the answer to the first question for more on this.

How well did you know this?

Not at all

Perfectly

Q: What does NLP stand for?

NLP stands for Natural Language Processing. It is a branch of artificial intelligence that gives machines the ability to read and understand human languages.

How well did you know this?

Not at all

Perfectly

Q: When would you use random forests Vs SVM and why?

There are a couple of reasons why a random forest is a better choice of model than a support vector machine:
Random forests allow you to determine the feature importance. SVM’s can’t do this.
Random forests are much quicker and simpler to build than an SVM.
For multi-class classification problems, SVMs require a one-vs-rest method, which is less scalable and more memory intensive.

How well did you know this?

Not at all

Perfectly

Q: Why is dimension reduction important?

Dimensionality reduction is the process of reducing the number of features in a dataset. This is important mainly in the case when you want to reduce variance in your model (overfitting).
Wikipedia states four advantages of dimensionality reduction (see here):
It reduces the time and storage space required
Removal of multi-collinearity improves the interpretation of the parameters of the machine learning model
It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D
It avoids the curse of dimensionality

How well did you know this?

Not at all

Perfectly

Q: What is principal component analysis? Explain the sort of problems you would use PCA for.

In its simplest sense, PCA involves project higher dimensional data (eg. 3 dimensions) to a smaller space (eg. 2 dimensions). This results in a lower dimension of data, (2 dimensions instead of 3 dimensions) while keeping all original variables in the model.
PCA is commonly used for compression purposes, to reduce required memory and to speed up the algorithm, as well as for visualization purposes, making it easier to summarize data.

How well did you know this?

Not at all

Perfectly

Q: Why is Naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?

One major drawback of Naive Bayes is that it holds a strong assumption in that the features are assumed to be uncorrelated with one another, which typically is never the case.
One way to improve such an algorithm that uses Naive Bayes is by decorrelating the features so that the assumption holds true.

Q: What are the drawbacks of a linear model?

There are a couple of drawbacks of a linear model:
A linear model holds some strong assumptions that may not be true in application. It assumes a linear relationship, multivariate normality, no or little multicollinearity, no auto-correlation, and homoscedasticity
A linear model can’t be used for discrete or binary outcomes.
You can’t vary the model flexibility of a linear model.

Q: Do you think 50 small decision trees are better than a large one? Why?

Another way of asking this question is “Is a random forest a better model than a decision tree?” And the answer is yes because a random forest is an ensemble method that takes many weak decision trees to make a strong learner. Random forests are more accurate, more robust, and less prone to overfitting.

Q: Why is mean square error a bad measure of model performance? What would you suggest instead?

Mean Squared Error (MSE) gives a relatively high weight to large errors — therefore, MSE tends to put too much emphasis on large deviations. A more robust alternative is MAE (mean absolute deviation).

Q: What are the assumptions required for linear regression? What if some of these assumptions are violated?

The assumptions are as follows:
The sample data used to fit the model is representative of the population
The relationship between X and the mean of Y is linear
The variance of the residual is the same for any value of X (homoscedasticity)
Observations are independent of each other
For any value of X, Y is normally distributed.
Extreme violations of these assumptions will make the results redundant. Small violations of these assumptions will result in a greater bias or variance of the estimate.

Q: What is collinearity and what to do with it? How to remove multicollinearity?

Multicollinearity exists when an independent variable is highly correlated with another independent variable in a multiple regression equation. This can be problematic because it undermines the statistical significance of an independent variable.
You could use the Variance Inflation Factors (VIF) to determine if there is any multicollinearity between independent variables — a standard benchmark is that if the VIF is greater than 5 then multicollinearity exists.

Q: How to check if the regression model fits the data well?

There are a couple of metrics that you can use:
R-squared/Adjusted R-squared: Relative measure of fit. This was explained in a previous answer
F1 Score: Evaluates the null hypothesis that all regression coefficients are equal to zero vs the alternative hypothesis that at least one doesn’t equal zero
RMSE: Absolute measure of fit.

Q: What is a decision tree?

Decision trees are a popular model, used in operations research, strategic planning, and machine learning. Each square above is called a node, and the more nodes you have, the more accurate your decision tree will be (generally). The last nodes of the decision tree, where a decision is made, are called the leaves of the tree. Decision trees are intuitive and easy to build but fall short when it comes to accuracy.

Q: What is a random forest? Why is it good?

Random forests are an ensemble learning technique that builds off of decision trees. Random forests involve creating multiple decision trees using bootstrapped datasets of the original data and randomly selecting a subset of variables at each step of the decision tree. The model then selects the mode of all of the predictions of each decision tree. By relying on a “majority wins” model, it reduces the risk of error from an individual tree.
For example, if we created one decision tree, the third one, it would predict 0. But if we relied on the mode of all 4 decision trees, the predicted value would be 1. This is the power of random forests.
Random forests offer several other benefits including strong performance, can model non-linear boundaries, no cross-validation needed, and gives feature importance.

Q: What is a kernel? Explain the kernel trick

A kernel is a way of computing the dot product of two vectors 𝐱x and 𝐲y in some (possibly very high dimensional) feature space, which is why kernel functions are sometimes called “generalized dot product” [2] The kernel trick is a method of using a linear classifier to solve a non-linear problem by transforming linearly inseparable data to linearly separable ones in a higher dimension.

Q: Is it beneficial to perform dimensionality reduction before fitting an SVM? Why or why not?

When the number of features is greater than the number of observations, then performing dimensionality reduction will generally improve the SVM.

Q: What is overfitting?

Overfitting is an error where the model ‘fits’ the data too well, resulting in a model with high variance and low bias. As a consequence, an overfit model will inaccurately predict new data points even though it has a high accuracy on the training data.

Q: What is boosting?

Boosting is an ensemble method to improve a model by reducing its bias and variance, ultimately converting weak learners to strong learners. The general idea is to train a weak learner and sequentially iterate and improve the model by learning from the previous learner. You can learn more about it here.