Concepts Flashcards
(87 cards)
Machine Learning
The process of creating an algorithm that predicts an outcome from data and can improve its performance through experience.
Supervised learning algorithms are…
- Trained on labeled data
Unsupervised learning algorithms are…
- trained on unlabeled data
Regularization is…
Any process that reduces generalization error (i.e. testing error) but not training error. It controls a model’s capacity (it’s ability to fit a wide variety of functions) and therefore prevents overfitting.
Examples include:
- L1/L2 in linear/logistic regression
Hyperparameters are…
Parameters that can be used to control an algorithm’s behavior but are not learned. These should be tuned on a validation set.
Examples include:
- alpha in linear regression, or “learning rate”, which controls the step size in the gradient descent algorithm
Difference between regression and classification
Regression is a process to predict continuous output values.
Classification is a process to predict categorical output values.
What is a cost function?
A cost function measures the accuracy of our prediction. It quantifies how well our predicted outcome matches the actual outcome.
Linear regression
- When is it used?
- What is the hypothesis?
- What is the cost function?
- Are there any assumptions?
Linear regression is used to predict a continuous outcome (e.g. house prices) from one or more input variables. These input variables can be continuous or categorical, but they must be represented numerically.
The hypothesis is a linear model:
y = theta_0 + theta_1 * x ; y = transpose of theta * x (first column of x is all 1’s)
A common cost function is Mean Squared Error (MSE). This function is parabolic in the univariate case.
Assumptions (check?):
- linearity
- normality
- independence
- no multicollinearity
What is the Mean Standard Error (MSE) cost function for linear regression?
J = (1/2n) * sum from 1 to n(y_predicted - y_actual)^2
= (1/2n) * sum from 1 to n(theta_0 + theta_1 * x_i - y_i)^2
= (1/2n) * (x * theta - y) transpose * (x * theta - y)
There are two ways to determine the coefficients for a linear regression model. What are they?
The Mean Standard Error (MSE) cost function can be minimized using gradient descent.
In the special case of linear regression, the cost function can also be minimized analytically.
Explain gradient descent.
Gradient descent is an algorithm that updates the coefficients to minimize the cost function.
In the case of linear regression:
- initial values of theta are chosen
- these are updated iteratively based on the slope of the cost function; we take steps along the cost function in the direction of greatest descent
- the size of the steps is controlled by hyperparameter alpha (“learning rate”)
- this occurs until a minimum is found (stopping conditions?)
The update equations look something like:
theta_updated = theta_current - alpha * partial derivative of the cost function with respect to theta_current
Discuss the effect of learning rate (hyperparameter alpha) in linear regression.
Alpha controls the rate of gradient descent when minimizing a cost function for linear regression. A larger value of alpha produces a larger step size, while a smaller value of alpha produces a smaller step size. If alpha is too small, you can find the minimum very precisely, but the algorithm may take a long time to converge. If alpha is too large, you may overshoot the minimum, and the algorithm may fail to converge or even diverge.
Note that alpha naturally gets smaller as the number of iterations increases (and thus the slope of the cost function approaches zero).
Things to keep in mind when preparing data for Machine Learning.
Gradient descent will work best if all of the input values x are between -1 and 1, or even -0.5 and 0.5. To achieve this:
- Feature scaling: divide all input values by the range of input values to achieve a range of 1
- Mean normalization: subtract the average value of each input variable from the values of that input variable to achieve an average of 0
- Standardization: subtract the average value of each input variable from the values of that input variable and divide by the standard deviation of that input variable to achieve an average of 0 and a standard deviation of 1
How might you assess how well gradient descent is working?
Plot the value of the cost function at each iteration by the interaction number. The function should be strictly decreasing.
Define MAE
Mean Absolute Error - a metric for accessing the accuracy of a regression model.
It is the average of the absolute differences of the residuals.
MAE = (1/n) * sum from 1 to n ( abs( y_i, actual - y_i, predicted) )
Smaller values of MAE indicate better model performance.
MAE places bounds on root mean squared error (RMSE):
MAE <= RMSE <= MAE*sqrt(n)
Explain the need for validation and test sets.
You need a validation set to tune hyperparameters without letting your model see your “test” set.
You need a test set because, once you’ve trained a model, you need to be able to assess its performance in the real world (i.e. on data it’s never seen before).
Explain regularization in the context of linear regression.
Regularization increases generalizability by penalizing non-zero coefficients. (I.e., you want to discourage having more parameters than you need in your model).
L1 (or lasso) penalizes by the sum of the absolute values of the coefficients (Manhattan distance)
L2 (or ridge) penalizes by the sum of the squared coefficients * 1/2 (Euclidean distance)
Explain L1 regularization
L1 regularization, or lasso regularization adds the following term to a cost function: J = J + lambda*(sum from 1 to n of the absolute value of the coefficients). This is sum is also known as the Manhattan distance.
It has the effect of setting small coefficients to 0, thereby doing feature reduction. This improves interpretability.
Hyperparameter lambda controls how strong this penalty is.
The default lambda value in sklearn (lambda is termed “alpha” in sklearn.linear_model.Lasso) is 1.
Explain L2 regularization
L2 regularization, or ridge regularization adds the following term to a cost function: J = J + (1/2)lambda(sum from 1 to n of the coefficients^2). This is sum is also known as the Euclidean distance.
It has the effect of setting small coefficients to almost 0, so it does not eliminate any features. As a result, models using L2 regularization often have a high number of features and can be prone to overfitting.
Hyperparameter lambda controls how strong this penalty is.
The default lambda value in sklearn (lambda is termed “alpha” in sklearn.linear_model.Lasso) is 1.
L2 regularization requires the data to be scaled (? And L1 doesn’t?)
Logistic regression
- When is it used?
- What is the hypothesis?
- What is the cost function?
- Are there any assumptions?
Logistic regression is the simplest classification algorithm. It predicts the probability of a positive outcome based on a set of input features. These features can be continuous or categorical, but they must be represented numerically. In mathematical terms, it predicts p(y = 1 | x). The output is 0 or 1, depending on whether the predicted probabilities are greater or less than some threshold (usually 0.5).
The hypothesis is that the log-odds of a positive outcome is a linear combination of input features:
p(x) = (e^(b + thetax))/(1 + e^(b + thetax))
The cost function is: Binary cross entropy (also called log loss)
Assumptions:
- binary predictions
- independence
- log odds of the output can be modeled as a linear combination of the inputs
Define odds and log odds
odds = p(x) / (1 - p(x)) log(odds) = log(p(x))
if p(x) = (e^(b + theta*x))/(1 + e^(b + theta*x)) then log(odds) = b + theta*x
How are the coefficients of logistic regression typically estimated?
The cost function for logistic regression is usually minimized through Maximum Likelihood Estimation (MLE).
MLE is usually implemented using the Quasi-newton method.
If you’re implementing by hand, it’s easier to use gradient descent.
Explain Maximum Likelihood Estimation (MLE)
MLE is a method used to estimate the parameters of a model. It picks the parameter values such that they maximize the likelihood that the process described by the model produced the data that were actually observed.
I.e., it estimates which curve (e.g. a normal curve) was most likely responsible for generating the data points observed. In the case that we believe a normal distribution was the process that generated the data, MLE will find the values of mu and sigma that describe the curve that best fits the observed data.
Explain the decision tree learning algorithm.
Decision trees can be used on either categorical or continuous data and can predict either a categorical or a continuous output variable.
In a decision tree, a set of rules are chosen relating to the input features, and the rows of data that meet the rules are passed on to the next level of the tree. The rules and which features they operate on are usually chosen by default by the model; a parameter that is commonly set by the modeler, however, is the depth of the tree. It is usual to try a deeper tree to begin with, and then to scale back if the model overfits.
It is common to use a depth of 10, which will result in 2^10 leaf nodes at the bottom of the tree. But if each node only contains a few examples, the model will be prone to overfitting (not enough data to make generalizable conclusions). A sensible parameter to tune on a validation set in sklearn to handle this problem is max_leaf_nodes (e.g. try between 5 and 500).