DS - Concepts Flashcards

Question

[ML] Regularization

Answer 1

- For linear models, regularization is typically achieved by constraining the weights of the model. The three methods are: Ridge Regression, Lasso Regression and Elastic Net. - Add a penalty λ for large coefficients to the cost function, which reduces overfitting. Requires normalized data.

Answer 2

- Ridge regression is a regularized version of linear regression with a regularization term added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. However the regularization term is only added to the cost function during training. Which is common to have a different cost version for training vs testing model performance. - The hyperparameter a controls how much you want to regularize the model. If a = 0, then Ridge Regression is just Linear Regression. If a is vey large, then all weights end up very close to zero and the result is a flat line going through the data's mean. The bias term is not regularized - The penalty hyperparameter sets the type of regularization to use. Specifying 'l2' indicates that you want SGD to add a regularization term to the cost function equal to half the square of the l2 norm of the weight vector: this is simply Ridge Regression - Reduces effects of multicollinearity - L2 regularization tends to spread error among all the terms and corresponds to a Guassian prior

Answer 3

- L1 regularization is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting and corresponds to setting a Laplacean prior - Least Absolute Shrinkage and Selection Operator Regression - Adds regularization term to the cost function but it uses the l1 norm of the weight vector instead of half the square of the l2 norm - LASSO tends to eliminate the weights of the least important features - It automatically preforms feature selection and outputs a sparse model (ie with few nonzero feature weights)

Answer 4

The possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalizations is a fundamental problem. There are three main methods to avoid overfitting: - Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data. - Use cross-validation techniques such as k-fold cross-validation. - Use regularization techniques such as LASSO that penalize certain model parameters if they’re likely to cause overfitting.

Answer 5

An imbalanced dataset is when you have for example a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data. A few tactics to solve this are: - Collect more data to even the imbalances in the dataset - Resample the dataset to correct for imbalances - Try a different algorithm altogether on your dataset

Answer 6

- Is the middle ground between Ridge Regression and LASSO Regression - The regularization term is a simple mix of Ridge and LASSO and you can control the mix ratio r. When r = 0, Elastic Net is equivalent to Ridge and when r = 1, it is equivalent to LASSO. - Which regularization to use: - Generally avoid plain Linear Regression - If only few features are suspected to be useful, then prefer LASSO or Elastic Net because they tend to reduce the useless features weights down to zero - Elastic Net is preferred over LASSO because LASSO may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated

Answer 7

Transforming categorical features into several binary ones. Increases dimensionality of the feature vector.

Answer 8

Process of converting a continuous feature into multiple binary features called bins or buckets, typically based on value range

Answer 9

Process of converting an actual range of values which a numerical feature can take, into a standard range of values, typically in the interval [-1,1] or [0,1]. This can lead to increased speed of learning.

Answer 10

The procedure during which the feature values are rescaled so that they have the properties of a standard normal distribution with μ=0 and o=1,whereμis the mean (the average value of the feature, averaged over all examples in the dataset) and‡is the standard deviation from the mean.

Answer 11

1) Replace the missing values with the average value of the feature in the dataset. 2) Replace the missing value by same value outside the normal range of values.

Answer 12

1) Take the entire data set as input 2) Calculate entropy of the target variable, as well as the predictor attributes 3) Calculate your information gain of all attributes (we gain information on sorting different objects from each other) 4) Choose the attribute with the highest information gain as the root node 5) Repeat the same procedure on every branch until the decision node of each branch is finalized

Answer 13

1) Randomly select 'k' features from a total of 'm' features where k << m 2) Among the 'k' features, calculate the node D using the best split point 3) Split the node into daughter nodes using the best split 4) Repeat steps two and three until leaf nodes are finalized 5) Build forest by repeating steps one to four for 'n' times to create 'n' number of trees

Answer 14

With an overfit model, you get very accurate predictions on the training data but make less precise predictions on the test and real world data. Overfitting occurs when the model is overly complex and captures the noise of the data. An underfit model is overly simple; it does not find the data’s underlying patterns. Inaccurate predictions are present in both the training and test results. Underfit models can be caused by insufficient data that covers all combinations, or improper randomization.

Answer 15

The measure of randomness/variance. The higher the value, the harder it is to draw conclusions. A result of 0 entropy means perfect classification. A greedy algorithm seeks to homogenize data quickly by reducing entropy.

Answer 16

Type I → False Positive (value was classified as positive but is actually negative). Type II → False Negative (value was classified as negative but is actually positive)

DS - Concepts Flashcards

(40 cards)