AdaBoost can be interpreted as a sequential procedure for minimizing the exponential loss on the training set with respect to the coefficients of a particular basis function expansion. This leads to generalizations of the algorithm to different loss functions.
Boosting: Gradient Boosted Decision Trees
A measurement of the cost to the performance task of making a prediction Y’ when the actual label is y. The use of accuracy to evaluate a model assumes uniform costs of errors and uniform benefits of correct classifications. Example: “Squared error function”.
1: measures the expected error rate if one of the results from a set is randomly applied to one of the items in the set.
An iterative optimization algorithm for finding a local minimum. At each iteration, gradient descent operates by moving the current solution in the direction of the negative gradient of the function (the direction of “steepest descent”). Also known as steepest descent.
Optimization: Convex optimization
Optimization: genetic algorithms
starts with a set of random solutions called the “population”. Costs are calculated for all these solutions and the best solutions are saved (called “elitism”). The other solutions are discarded and are replaced using two methods. The first is “mutation” where simple random changes are made to the elite solutions. The second is “crossover” or “breeding” where two elite solutions are combined, such as by replacing random values in one with values in the other. This is repeated either for a set number of iterations or until several iterations have passed with no improvement.
Optimization: hill climbing
starts with a random solution and looks at the set of neighboring solutions for those that are better, this is repeated continuously until no neighboring solution has a lower cost. Major drawback is that it will get stuck in a local minimum
Optimization: Information gain
difference between the current entropy and the weighted-average entropy of the two new groups. Used when deciding which attribute to use when splitting instances in a decision tree
Optimization: random-restart hill climbing
running a hill climbing algorithm several times using new random initial input to attempt to reach the global minimum
Optimization: simulated annealing
similar to hill climbing, but adds a new variable “temperature” which starts high and gradually cools. With each iteration one variable is randomly changed in a certain direction. The new cost is compared to the old cost and is accepted if the new cost is lower. If the new cost is higher, it may be accepted based on a probability that is computed from the current “temperature”. As the temperature cools, it gets less and less likely that the worse solution would be accepted.
Optimization: when do optimization algorithms fail?
when there is no clear slope in the change in cost function as the values are varied. This is because these algorithms all attempt to follow that slope toward a global minimum.
A way of applying Occam’s Razor to real problems.
Penalizes the higher-order parameters to avoid overfitting
Decrease complexity of a model
Get a simpler hypothesis
Improve the generalization of a model
Introducing a regularization term to a general loss function: adding a term to the minimization problem
lambda - controlls the trade-off between fitting the data well & keeping the parameters small.
Stochastic Gradient Descent (SGD)
Stochastic gradient descent (SGD) is an iterative optimization algorithm that can be applied to functions that are a linear combination of differentiable functions. These types of functions often arise when the full objective function is a linear combination of objective functions at each data point, e.g. a least squares objective function. While batch gradient descent uses the full gradient of the function, SGD approximates the full gradient by using the gradient at each of the functions in the linear combination, e.g. the gradient of the objective function at each data point. SGD is often used to optimize non-convex functions, e.g. those that arise in neural networks.
- We use an L2 penalty when fitting the model using least squares
- We add to the minimization problem an expression (shrinkage penalty) of the form λ×∑coefficients
- λ: tuning parameter; controls the bias-variance tradeoff; accessed with cross-validation
- A bit faster than the lasso
- We use an L1 penalty when fitting the model using least squares
- Can force regression coefficients to be exactly: feature selection method by itself