# Machine Learning Flashcards

logistic regression

Below is an example logistic regression equation:

y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

Where y is the predicted output, b0 is the bias or intercept term and b1 is the coefficient for the single input value (x). Each column in your input data has an associated b coefficient (a constant real value) that must be learned from your training data.

random forest

an ensemble learning method that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.

Bags both features (random subset) and trees (with replacement)

GINI impurity

a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

LDA

a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

bias variance tradeoff

the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set

The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).

The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).

box-cox

power transformation

a useful data transformation technique used to stabilize variance, make the data more normal distribution-like

stochastic gradient descent

a stochastic approximation of the gradient descent optimization and iterative method for minimizing an objective function that is written as a sum of differentiable functions

AIC

Akaike information criterion (AIC)

k = number of estimated parameters

L = max like

AIC =2k-2*ln(L)

Leave-one-out cross-validation is asymptotically equivalent to AIC, for ordinary linear regression models

BIC

Bayesian information criterion

BIC = ln(n)*k-2*ln(L)

k = num parameters L = max like n = num observations

DIC

deviance information criterion

D(theta)=-2*log(p(y | theta))+C

p_D = D_bar-D or p_D = 1/2 var(D(theta))

DIC= p_D +D_bar

SVM

a hyperplane or set of hyperplanes in a high- or infinite-dimensional space. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier

kernel trick

The kernel trick avoids the explicit mapping that is needed to get linear learning algorithms to learn a nonlinear function or decision boundary. For all x and x_prime in the input space chi, certain functions k(x,x_prime) can be expressed as an inner product in another space V. The function k: chi x chi -> R is often referred to as a kernel or a kernel function.

Odds ratio logistic regression

ln(p(X) / 1 – p(X)) = b0 + b1 * X

Left side is odds ratio

Logistic regression assumptions

Binary output variable

No error in output variable y (remove outliers first)

Linear model (with non-linear transform on output) must transform data for non linear (box cox, log, root)

Must remove correlated inputs (use pairwise distance metric, correlation)

Fails to converge if too many colinear or data sparse

Observations independent

Large sample

Logit

Inverse of logistic (sigmoid)

Log odds when logistic represents a probability