Boosting

AdaBoost can be interpreted as a sequential procedure for minimizing the exponential loss on the training set with respect to the coefficients of a particular basis function expansion. This leads to generalizations of the algorithm to different loss functions.

Boosting: Gradient Boosted Decision Trees

…

Cost Function

A measurement of the cost to the performance task of making a prediction Y’ when the actual label is y. The use of accuracy to evaluate a model assumes uniform costs of errors and uniform benefits of correct classifications. Example: “Squared error function”.

Gini Impurity

1: measures the expected error rate if one of the results from a set is randomly applied to one of the items in the set.

Gradient Descent

An iterative optimization algorithm for finding a local minimum. At each iteration, gradient descent operates by moving the current solution in the direction of the negative gradient of the function (the direction of “steepest descent”). Also known as steepest descent.

Optimization: Convex optimization

…

Optimization: genetic algorithms

starts with a set of random solutions called the “population”. Costs are calculated for all these solutions and the best solutions are saved (called “elitism”). The other solutions are discarded and are replaced using two methods. The first is “mutation” where simple random changes are made to the elite solutions. The second is “crossover” or “breeding” where two elite solutions are combined, such as by replacing random values in one with values in the other. This is repeated either for a set number of iterations or until several iterations have passed with no improvement.

Optimization: hill climbing

starts with a random solution and looks at the set of neighboring solutions for those that are better, this is repeated continuously until no neighboring solution has a lower cost. Major drawback is that it will get stuck in a local minimum

Optimization: Information gain

difference between the current entropy and the weighted-average entropy of the two new groups. Used when deciding which attribute to use when splitting instances in a decision tree

Optimization: random-restart hill climbing

running a hill climbing algorithm several times using new random initial input to attempt to reach the global minimum

Optimization: simulated annealing

similar to hill climbing, but adds a new variable “temperature” which starts high and gradually cools. With each iteration one variable is randomly changed in a certain direction. The new cost is compared to the old cost and is accepted if the new cost is lower. If the new cost is higher, it may be accepted based on a probability that is computed from the current “temperature”. As the temperature cools, it gets less and less likely that the worse solution would be accepted.

Optimization: when do optimization algorithms fail?

when there is no clear slope in the change in cost function as the values are varied. This is because these algorithms all attempt to follow that slope toward a global minimum.

Regularization

A way of applying Occam’s Razor to real problems.

Penalizes the higher-order parameters to avoid overfitting

Decrease complexity of a model

Get a simpler hypothesis

Improve the generalization of a model

Introducing a regularization term to a general loss function: adding a term to the minimization problem

Regularization: Parameter

lambda - controlls the trade-off between fitting the data well & keeping the parameters small.

Stochastic Gradient Descent (SGD)

Stochastic gradient descent (SGD) is an iterative optimization algorithm that can be applied to functions that are a linear combination of differentiable functions. These types of functions often arise when the full objective function is a linear combination of objective functions at each data point, e.g. a least squares objective function. While batch gradient descent uses the full gradient of the function, SGD approximates the full gradient by using the gradient at each of the functions in the linear combination, e.g. the gradient of the objective function at each data point. SGD is often used to optimize non-convex functions, e.g. those that arise in neural networks.

Regularization: Types

Ridge regression:

- We use an L2 penalty when fitting the model using least squares
- We add to the minimization problem an expression (shrinkage penalty) of the form λ×∑coefficients
- λ: tuning parameter; controls the bias-variance tradeoff; accessed with cross-validation
- A bit faster than the lasso

β^ridge=argminβ{∑ni=1(yi−β0−∑pj=1xijβj)2+λ∑pj=1β2j}

The Lasso:

- We use an L1 penalty when fitting the model using least squares
- Can force regression coefficients to be exactly: feature selection method by itself
- β^lasso=argminβ{∑ni=1(yi−β0−∑pj=1xijβj)2+λ∑pj=1||βj||}