ML - Preprocessing & Evaluation Flashcards

Question

What are the main hyperparameters that you can tune for decision trees?

Answer 1

Generally speaking, we have the following parameters: max depth - maximum tree depth min samples split - minimum number of samples for a node to be split min samples leaf - minimum number of samples for each leaf node max leaf nodes - the maximum number of leaf nodes in the tree max features - maximum number of features that are evaluated for splitting at each node (only valid for algorithms that randomize features considered at each split) Other similar hyperparameters may be derived from the above hyperparameters. The "traditional" decision tree is greedy and looks at all features at each split point, but many modern implementations allow splitting on randomized features (as seen in sklearn), so max features is may or may not be a tuneable hyperparameter.

Answer 2

Generally speaking... max depth - increasing max depth will decreases bias and increases variance min samples split - increasing min samples split increases bias and decreases variance min samples leaf - increasing min samples leaf increases bias and decreases variance max leaf nodes - decreasing max leaf node increases bias and decreases variance max features - decreasing maximum features increases bias and decreases variance There may be instances when changing hyperparameters has no effect on the model.

Answer 3

Gini impurity or entropy. Both generally produce similar results.

Answer 4

Gini impurity (also called the Gini index) is a measurement of how often a randomly chosen record would be incorrectly classified if it was randomly classified using the distribution of the set of samples.

Answer 5

Low Gini (near 0) = most records from the sample are in the same class High Gini (maximum of 1 or less, depending on number of classes) = records from sample are spread evenly across classes

Answer 6

Entropy is the measure of the purity of members among non-empty classes. It is very similar to Gini in concept, but a slightly different calculation.

Answer 7

Low Entropy (near 0) = most records from the sample are in the same class High Entropy (maximum of 1) = records from sample are spread evenly across classes

Answer 8

Non-parametric. The number of model parameters is not determined before creating the model.

Answer 9

- Reduce maximum depth - Increase min samples split - Balance your data to prevent bias toward dominant classes - Increase the number of samples - Decrease the number of features

Answer 10

The features that are split on most frequently and are closest to the top of the tree, thus affecting the largest number of samples, are considered to be the most important.

Answer 11

Random forest is an ensemble method that uses bagged decision trees with random feature subsets chosen at each split point. It then either averages the prediction results of each tree (regression) or using votes from each tree (classification) to make the final prediction.

Answer 12

Random forest is essentially bagged decision trees with random feature subsets chosen at each split point, so we have 2 new hyperparameters that we can tune: num estimators - the number of decision trees in the forest max features - maximum number of features that are evaluated for splitting at each node

Answer 13

No, random forest models are generally not prone to overfitting because the bagging and randomized feature selection tends to average out any noise in the model. Adding more trees does not cause overfitting since the randomization process continues to average out noise (more trees generally reduces overfitting in random forest). In general, bagging algorithms are robust to overfitting. Having said that, it is possible to overfit with random forest models if the underlying decision trees have extremely high variance, e.g. extremely high depth and low min sample split, and a large percentage of features are considered at each split point, e.g. if every tree is identical, then random forest may overfit the data.

Answer 14

Gradient boosting involves using multiple weak predictors (decision trees) to create a strong predictor. Specifically, it includes a loss function that calculates the gradient of the error with regard to each feature and then iteratively creates new decision trees that minimize the current error. More and more trees are added to the current model to continue correcting error until improvements fall below some minimum threshold or a pre-decided number of trees have been created.

Answer 15

The main hyperparameters that can be tuned with GBM models are: Loss function - loss function to calculate gradient of error Learning rate - the rate at which new trees correct/modify the existing predictor Num estimators - the total number of tress to produce for the final predictor Additional hyperparameters specific to the loss function Some specific implementations, e.g. stochastic gradient boosting, may have additional hyperparameters such as subsample size (subsample size affects the randomization in stochastic variations).

Answer 16

Reducing the learning rate or reducing the maximum number of estimators are the two easiest ways to deal with gradient boosting models that overfit the data. With stochastic gradient boosting, reducing subsample size is an additional way to combat overfitting. Boosting algorithms tend to be vulnerable to overfitting, so knowing how to reduce overfitting is important.

Answer 17

Dimensionality reduction can allow you to: • Remove collinearity from the feature space • Speed up training by reducing the number of features • Reduce memory usage by reducing the number of features • Identify underlying, latent features that impact multiple features in the original space

Answer 18

Dimensionality reduction can: • Add extra unnecessary computation • Make the model difficult to interpret if the latent features are not easy to understand • Add complexity to the model pipeline • Reduce the predictive power of the model if too much signal is lost

Answer 19

1. Principal component analysis (PCA) - uses an eigen decomposition to transform the original feature data into linearly independent eigenvectors. The most important vectors (with highest eigenvalues) are then selected to represent the features in the transformed space 2. Non-negative matrix factorization (NMF) - can be used to reduce dimensionality for certain problem types while preserving more information than PCA 3. Embedding techniques - various embedding techniques, e.g. finding local neighbors as done in Local Linear Embedding, can be used to reduce dimensionality 4. Clustering or centroid techniques - each value can be described as a member of a cluster, a linear combination of clusters, or a linear combination of cluster centroids By far the most popular is PCA and similar eigen-decomposition-based variations.

Answer 20

Yes and no. Most dimensonality reduction techniques have inverse transformations, but signal is often lost when reducing dimensions, so the inverse transformation is usually only an approximation of the original data.

Answer 21

Selecting the number of latent features to retain is typically done by inspecting the eigenvalue of each eigenvector. As eigenvalues decrease, the impact of the latent feature on the target variable also decreases. This means that principal components with small eigenvalues have a small impact on the model and can be removed. There are various rules of thumb, but one general rule is to include the most significant principal components that account for at least 95% of the variation in the features.

Answer 22

kNN makes a prediction by averaging the k neighbors nearest to a given data point. For example, if we wanted to predict how much money a potential customer would spend at our store, we could find the 5 customers most similar to her and average their spending to make the prediction. The average could be weighted based on similarity between data points and the similarity, aka "distance," metric could be modified as well.

Answer 23

kNN is non-parametric and can be used as either a classifier or regressor.

Answer 24

There is no closed-form solution for calculating k, so various heuristics are often used. It may be easiest to simply do cross validation and test several different values for k and choose the one that produces the smallest error during cross validation. As k increases, bias tends to increase and variance decreases.

Answer 25

k-means clustering in a unsupervised clustering algorithm that partitions observations into k clusters. The cluster means are usually randomized at the start (often by choosing random observations from the data) and then updated/shifted as more records are observed. At each iterations, a new observation is assigned to a cluster based on which cluster mean it is nearest and then the means are recalculated, or updated, with the new observation information included.

Answer 26

Customer segmentation is probably the most common use case for k-means clustering (although it has many uses in various industries). Often, unsupervised clustering is used to identify groups of similar customers (or data points) and then another predictive model is trained on each cluster. Then, new customers are first assigned a cluster and then scored using the appropriate model.

Answer 27

There is no 'ideal' number of clusters since increasing the number of clusters always captures more information about the features (the limiting case is k=number of observations, i.e. each observation is a 'cluster'). Having said that, there are various heuristics that attempt to identify the 'optimal' number of clusters by recognizing when increasing the number of clusters only marginally increases the information captured. The true answer is usually driven by the application, though. If a business has the ability to create 4 different offers, then they may want to create 4 customer clusters, regardless of the data.

Answer 28

One such method is the elbow method. In short, it attempts to identify the point at which adding additional clusters only marginally increases the variance explained by the clusters. The elbow is the point at which we begin to see diminishing returns in explained variance when increasing k.

Answer 29

No, there is no guarantee that k-means converges to the same set of clusters, even given the same samples from the same population. The clusters that are produced may be radically different based on the initial cluster means selected. For this reason, it is important that the cluster definitions remain static when using k-mean clustering in production to ensure that different clusters aren't created each time during training.

Answer 30

Appropriate: stochastic gradient descent, mini-batch gradient descent Avoided: normal equation (too computationally complex)

Answer 31

Problem: Low learning rate Solution: Increase the learning rate gradually (avoid making it so high that you jump over minima) Problem: Features have very dissimilar scales Solution: Rescale features using a rescaling technique

Answer 32

This is a difficult question and there is no easy way to automate this selection. It is suggested that you inspect the data and try to choose the order of polynomial that will best fit the data without overfitting. If the data is high-dimensional and can't be visualized, then you can train multiple models and observe when the validation error begins to increase instead of decrease. At this point you're probably overfitting your training data and should reduce the polynomial order to the point where validation error is minimized.

Answer 33

No. It is a linear model that can be used to fit non-linear data.

Answer 34

• Ridge regression - linear regression that adds L2-norm penalty/regularization term to the cost function • Lasso - linear regression that adds L1-norm penalty/regularization term to the cost function • Elastic Net - linear regression that adds mix of both L1- and L2-norm penalties terms to the cost function

Answer 35

You can tune the weight of the regularization term for regularized models (typically denoted as alpha), which affect how much the models will compress features. alpha = 0 --> regularized model is identical to original model alpha = 1 --> regularized model reduced the original model to a constant value

Answer 36

Regularized models tend to outperform non-regularized linear models, so it is suggested that you at least try using ridge regression. Lasso can be effective when you want to use to automatically do feature selection in order to create a simpler model but can be dangerous since it may be erratic and remove features that contain useful signal. Elastic net is a balance of ridge and lasso, and it can be used to the same effect as lasso with less erratic behavior.

Answer 37

Logistic regression models can be tuned using regularizations techniques (commonly L2 norm, but other norms may be used as well)

Answer 38

No, because the cost function is convex.

Answer 39

Logistic regression is usually used as a classifer because it predicts discrete classes. Having said that, it technically outputs a continuous value associated with each prediction. So we see that is is actually a regression algorithm, hence the name, that can solve classification problems. It is fair to say that it is a classifier because it is used for classification, although it is also technically also a regressor.

Answer 40

The hyperparameters that you can commonly tune for SVM are: • Regularization/cost parameter • Kernel • Degree of polynomial (if using a polynomial kernel) • Gamma (modifies the influence of nearby points on the support vector for Gaussian RBF kernels) • Coef0 (influences impact of high vs low degree polynomials for polynomial or sigmoid kernels) • Epsilon (a margin term used for SVM regressions)

Answer 41

SVM can be used for: • linear classification • nonlinear classification • linear regression • nonlinear regression

Answer 42

1. Linear 2. Polynomial 3. Gaussian RBF 4. Sigmoid

Answer 43

SVM tries to fit the widest gap between all classes, so unscaled features can cause some features to have a significantly larger or smaller impact on how the SVM split is created.

ML - Preprocessing & Evaluation Flashcards

(70 cards)