# ML Flashcards

ML: What are loss functions?

Functions that numerically compare your predictions Y* to the true values Y to provide feedback on the accuracy of your trained model.

ML: What is the typical loss function for classification? For regression?

Classification: 0/1 loss

Regression: Mean-Squared Error (MSE)

ML: What is a generative model?

It models P(Y|X) using P(X|Y)P(Y)

(derived from Bayes rule)

ML: What is a discriminative model?

Models P(Y|X) directly, as P(Y|X)

ML: What are some advantages of discriminative model over generative? What about generative over discriminative?

Discriminative models (Y|X) don’t need to make assumptions about the distribution of X|Y, and can be simpler as a result.

Generative models can be used to generate samples. They are also sometimes more intuitive.

ML: What is a Bayes classifier?

ML: What are 2 important characteristics of a Bayes classifier?

It is the best possible classifier

Its error is thus the irreducible error of a classification problem.

ML: At a high level, how would a Bayes Classifier be constructed in a case where errors are asymmetric, meaning some errors are worse than others?

We would weight each possible error with its own amount of loss L_{i,j}, pertaining to what you said the answer is and what the answer should’ve been.

ML: What is a simplistic view of logistic regression when Y is binary?

We are just doing a linear regression on our X’s, then squishing the outcome into [0,1]

ML: In binary logistic regression, what is the formula for P(Y=1|X=x)?

For x of any dimension, it is as follows:

ML: What is the formula for the logit^{-1} function, or the inverse logit function? And what does it accomplish?

It is shown below. It squishes all real numbers x into a range of [0,1]

ML: In binary logistic regression, given the formula for P(Y=1|X=x), how do we choose our predicted outcome?

We of course simply choose the outcome with the higher probability.

ML: What is the final decision rule for logistic regression in its simplest form?

ML: What decision boundary does binary logistic regression yield?

It yields the *linear separator:*

ML: For binary logistic regression, how do we learn the coefficients of the model B_{0}, B_{1}, B_{2}… in order to make predictions?

We estimate them using simple maximum likelihood. So we find the coefficients B_{0}, B_{1}, B_{2}… that have the highest likelihood of producing the observed training data with the observed labels.

However, this optimization function using 0/1 loss is computationally hard, so we solve it iteratively using something like stochastic gradient descent.

ML: How to we extend binary logistic regression to multinomial logistic regression? So our X’s are still n-dimensional vectors, but now our Y’s are scalars in [1,K]?

ML: What issue arises when logistic regression is performed on data that is perfectly linearly separable? And how can we combat this?

The learned weights will go to infinity, and we will overfit.

We can combat this by regularizing!

ML: What does having a high number of parameters in your model do to the bias/variance tradeoff?

Having a lot of parameters, or a complex model, *decreases bias* (as we have more flexibility to fit whatever truly is the underlying model) but *variance increases*

ML: What does having a high number of features in your model do to the bias/variance tradeoff?

Having a lot of featuers *decreases bias* (as we are able to fit more of the possible underlying models), but *variance increases* due to the curse of dimensionality.

ML: In *general*, what kinds of changes to your model will decrease bias, but increase variance?

When you allow your algorithm the potential ability to fit a *higher amount of potential underlying models*, or *more types of underlying models*.

This of course decreases bias as you are more likely to hit the real model, but it increases variance because (I think) when you have more possible options and the same amount of data, it becomes more likely that one of them appears as the best-fitting model due simply to noise.

ML: In *general*, what sorts of changes to your algorithm will decrease variance, but increase bias?

Changes that make your algorithm less influenced by outliers or noise.

ML: What is a definition of model bias that is useful to think about when considering the bias/variance tradeoff?

Bias in a model is a lack of ability to fit the underlying model.

In other words, a lack of *robustness* with which to fit potential underlying models

ML: What is a definition of model variance that is useful to think about when considering the bias/variance tradeoff?

Model variance is its amount of susceptibility to changes due to noise.

ML: What is the naive bayes assumption, both in words and mathematically?

In words, the naive bayes assumption is that there are *no interactions* between the features of n-dimensional feature vector X, or that the *context* of one of the features X_{i}, and how it interacts with other features X_{j}, is not relevant to prediction.

(An example is assuming seeing “Donald Trump” is has the same predictive value as seeing those two words 10 words apart and in reverse order.)

Mathematically, we have for D-dimensional X:

ML: Is Naive Bayes a discriminative or generative model?

Generative

ML: What is the form of P(Y=y|X=x) under the Naive Bayes classification algorithm?

P

ML: How do we learn a Naive Bayes classifier? What parameters need to be estimated, and what options do we have to estimate them?

For each Y=y, we need to estimate P(Y=y), as well as P(X_{i}=x_{i}|Y=y) for every possible value of X_{i}.

P(Y=y) is generally estimated through MLE. So, we just estimate it as the proportion of the training examples for which Y=y.

For P(X_{i}=x_{i}|Y=y) it can vary. You could also do basic population MLE (The proportion of X_{i}=x_{i }among training examples where Y=y), or you could learn a gaussian N(µ,σ^{2}) as with *gaussian naive bayes*, or something else!

ML: What are the advantages and disadvantages of Naive Bayes?

The disadvantage is that we cannot capture context, which is often very important. Hence this algorithm being “naive”.

The advantage is that it’s simple, and is much easier to train: because we ignore interactions, we only need to learn a bunch of univariate distributions rather than one giant *joint* distribution, which is much harder (I believe exponentially harder).

ML: What do the decision boundaries look like in multinomial logistic regression?

We’re finding n linear (or hyperplane) separators, one for each of the K possible Y=k which models P(Y=k) vs P(Y not k). As such, it looks like a bunch of k linear (or hyperplane) separators working together on the cartesian grid.

ML: What is the most common way to train a decision tree, without yet worrying about pruning?

At each step, simply greedily add the split that most decreases classification error on the training set (or MSE in the case of regression).

ML: What is the common way to validate a decision tree and combat overfitting?

Pruning. Once you have your full tree trained on training data, at each step, greedily remove the branch that causes the largest improvement on validation error. Continue until you can’t find any more improvements to validation error.

ML: How do we predict on a new point X using a trained decision tree, for both classification and regression?

Simply work your way a branch until you get to the bottom. For classification, return the most common label among the training examples at that branch. For regression, return the average training label.

ML: What is the main advantage of decision trees? (Not in ensembles or anything, just by themselves.)

They’re very interpretable, especially to non-technical people

ML: What sre some disadvantages of decision trees? (Not in ensembles or anything, just by themselves.)

They don’t predict very well.

Variance is very high, as small changes in training data can cause big differences in how the splits occur.

They also have a fair amount of bias, as they can only capture certain types of decision bounds (no “diagonal” decision bounds).

They also handle categorical predictor variables poorly if there are too many predictors.

ML: What issue do decision trees have with categorical *predictor* variables?

If the variable has N possible valuest each split, it must consider each of the 2^{N-1} subsets of the values as a splitting rule, which is of course very computationally hard.

ML: As the depth of a decision tree grows, what happens to bias and variance?

Bias decreases (as we increase complexity), and variance increases.

ML: How do ensemble classifiers generally work?

Several models are learned, and they vote on what the prediction is.

ML: What is boosting? How does it generally work?

In boosting, you fit several (typically very basic) classifiers *iteratively*, with each new classifier trying to do well on the training points that the *previous classifiers did poorly on.*

ML: How does boosting improve the performance of the classifiers its using? In otherwords, how does bias and/or variance impacted before and after the use of boosting?

Because it uses very simple predictors (for example, stumps of decision trees), each one individually has *very high bias*. But, by fitting a lot of them and correcting for previous mistakes, *bias is decreased*.

ML: How does prediction work for a KNN classifier or regressor?

For each new point, you find the K points in the training set closest to it (by some measure such as euclidean distance); then for classification you’d return the most common label among the nearest neighbors, and for regression you’d return the average label of the nearest neighbors.

ML: What is the (somewhat high-level) algorithm for Adaboost?

We start with each training example X_{i} having equal weights w_{i}. Then, repeat:

- Fit a new classifier that minimizes weighted training error, based on the weights w
_{i}. (Points on which earlier classifiers have performed poorly have higher weights, and errors on those points cost more) - Based on the weighted error of that classifier, find a weight alpha
_{j}for the*classifier*for when it votes later on. The lower its weighted error, the higher its alpha_{j}, and thus the higher its voting power. - Update each of the point weights w
_{i}based on this round’s results.

ML: How (at a high level) does gradient boosting work?

At each step, we find the current gradient of the loss function. We then find our next classifier to “approximate this gradient vector”.

So we in some way optimally take into account the gradient of the loss function at each step as we choose our classifiers.

ML: What issue do we run into when trying to minimize 0/1 loss for our classification problems? How can we solve this issue?

Because the 0/1 loss function is not smooth, as well as nonconvex, optimizing the solution is very hard, and you *also* can’t approximate it with iterative techniques like gradient descent (which is generally the solution when finding the true global minimum is impossible).

To combat this, you can instead iteratively minimize some *similar* loss function that puts an upper bound on 0/1 loss, such as hinge loss (pink), or exponential loss (yellow).

ML: How does bagging work, at a fairly high level? How do we train a baggng ensemble given a training set of n X_{i}’s?

Resample several samples from your training data; say m bootstrapped samples of size n, with replacement.

Train a (typically complex) model on each of the m bootstrapped samples. Have these models vote on the prediction.

ML: How do bias and variance work with bagging? So what is the level bias and variance from one of the individual classifiers in the ensemble, and then what happens to bias and variance when we have several classifiers vote?

The classifiers in bagging are complex; if they’re trees, they’re deep trees. As a result, the individual classifiers have low bias, but high variance.

However, when we use several classifiers and have them vote, we can drive down this variance because we’re averaging votes from classifiers trained on different bootstrapped samples. (As n goes infinity, these samples become independent.)

ML: What are the high-level differences between boosting and bagging, and how they achieve the goal of making several classifiers work effectively together?

Boosting fits several models that are not complex, and thus have high bias. But by fitting several and correcting errors from previous iterations, bias is reduced.

Bagging fits several *complex* models, so bias is lower but variance is high. But by averaging many predictions on slightly different datasets, we can lower variance.

ML: What is an improvement we can make to bagging (specifically classification bagging) when we are calculating our predicted class?

Rather than just picking the class with the most votes from each of the classifiers, average the *probabilities* of each class from each of the classifiers, then predict the class with the highest probability.

(If we get to the end of a branch in one of the classifiers, the “probability” of class j predicted by that classifier is the proportion of training examples that had class j).

This improvement makes performance more consistent, decreasing variance a bit.

ML: How do Random Forests work?

Random Forests work basically just by using bagging with deep decision trees: bootstrapping several resampled training sets, training a deep tree on each of them, and having them vote for a classification or regressed value.

The key change from traditional bagging is this: if there are p features of a training example X, at each split in a decision tree, only m < p features are considered for that split. (Typically m = sqrt(p))

ML: Why is the Random Forest alteration to bagging often advantageous?

By only considering m < p features at each split of a decision tree, the trees become more independent, which helps the random forest algorithm decrease the variance in the overall prediction (which is the general goal of bagging).

ML: What are two cons of bagging/random forests?

The models lack interpretibility.

Training and prediction both take a long time.

ML: What method is typically used to evaluate random forests instead of cross-validated error? Why is this other approach typically better for random forests?

You use *out-of-bag* error. Because each classifier is trained on a bootstrapped training set, each point in the original training set was *not* used in training some of the classifiers (about 36.8%). So, get a prediction on this point from these classifiers, and find the error on all the points from that method.

This is preferable to cross-validation because *you don’t need to retrain the model for each of the folds*; training takes a long time for random forests, so we want to avoid it. But with enough trees, out-of-bag error tends to be very similar to CV error.

ML: What do variable-importance measures look at? On which algorithm are they most often used and why?

Variable-importance measures look at how important a variable is to the predictions of the model.

They’re commonly used for random forests, because random forests lack interpretability without them.

And more recently in my experience, they’re big for xgboost. Ensemble tree methods just lend themselves to variable importance metrics, because you can look at how often a tree chooses to split on a given variable, how the model performs with and without trees that use a certain variable, etc.

ML: For random forests, what are the 2 most common ways to calculate variable importance metrics?

- Find all of the splits which use that variable in all of the decision trees, and then find how much on average those splits improve the predictions, using some measure of prediction performance like Gini index.
- Randomly permute each variable one at a time, and see how much model performance decreases for each. (I’m guessing this means to calculate out-of-bag error, and for each point, before getting the prediction, choose the value of the variable in question randomly.)

ML: What is clustering, and how does it differ from classification?

Clustering is an unsupervised learning technique, so our training data are not labeled, and we try to divide the data into groups that are in some way “similar”. So basically, we try to label each point such that similar points have the same label.

Classification is similar because each point has a label, but classification is supervised: we are given a labeling scheme and want to learn how to predict it as best as possible. With clustering, we aren’t given a labeling scheme, and want to find a sensible one.

ML: What does k-means clustering approximately minimize? And how does its approximation of this minimization work on a theory level?

ML: How do you run the k-means clustering algorithm?

After choosing your number of classes k, randomly initialize the locations of your k cluster centers. Then, repeat:

- Assign all training points to the cluster center that is closest by euclidean distance
- Set cluster center as the average of all points in the cluster

ML: What are some drawbacks to k-means clustering?

- It’s nondeterministic due to the random initializations, and is not guarenteed to reach the global optimum.
- Cluster center isn’t interpretable, as it’s an average of a bunch of points

ML: What is an advantage to k-means clustering?

It will always converge, as each step will decrease the in-cluster variance (until it’s at a local minimum)

ML: How does the k-medoids algorithm work? Specifically, what alterations are made to the k-means algorithm?

Here the cluster centers are actual points in the dataset, rather than an average of a bunch of points.

So we initialize k points randomly to be the initial cluster centers, then repeat:

- Assign all training points to the cluster center (a real point) that is closest to it
- For each cluster, choose the point in the cluster which minimizes mean squared distance from other points as the new cluster center. (In other words, minimize in-cluster variation.)

ML: What is a pro of k-medoids over k-means? What is a con?

K-medoids provides more interpretable cluster centers, as they are actual points in the dataset rather than an average.

However, this is at the cost of having slightly higher in-cluster variation for k-medoids, as you simply have “fewer choices” of where to put your cluster center.

ML: At a high level (so for an arbitrary linkage), how does the hierarchical clustering algorithm work?

Each training point starts as its own cluster. Then at each step, we merge the 2 clusters with the highest similarity into one. We use linkages to determine similarity.

ML: In hierarchical clustering, what are 4 common linkage options for determining similarity of 2 clusters?

- Single linkage: Dissimilarity between 2 clusters is the distance between the two points
*closest*to one another across groups - Complete linkage: Dissimilarity is the distance between the two points
*farthest*to one another across groups - Average linkage: Dissimilarity is the average distance across all pairs of points between groups

4: Centroid linkage: Dissimilarity is the distance between the averages of the points in each group

ML: What are a few advantages to hierarchical clustering?

- It is deterministic, and thus has lower variance
- You get clusterings for all values of k
- You have hierarchical clusters! If your clusters naturally have a sub-cluster architecture, this is useful.

ML: At a very high level, how do mixture models work?

When looking at our training data, we assume they come from a probability distribution that is a *weighted sum* of several simple underlying distributions, such as several gaussians.

ML: What advantage do mixture models have for clustering?

Rather than having each point discretely in one cluster, we compute the probability that each point is in each cluster. This is good for *overlapping clusters*, or for clusterings where some points have an unclear affiliation.

ML: At a high level, how are mixture models used to produce a clustering, and to determine the clustering of a new point?

To produce a clustering from training data, we learn the parameters of the k underlying simple distributions (using a technique called expectation maximization) that we think may be forming the overall distribution.

To “assign each point to a cluster”, we compute the likelihood that each of the simple distributions would have produced this point. Here each simple distribution acts as a cluster, with a center at the simple distribution’s mean, and we see probabilistically from which cluster the point was likely to come from. We thus don’t have a “hard clustering”

ML: For the purpose of clustering using graphs, what are 3 ways to compute a graph from training data?

- Connect 2 points iff their distance is below a threshold epsilon (epsilon-nearest-neighbors)
- Connect 2 points iff one is a KNN of the other, for some k
- Form a weighted graph where points with close euclidean distances have high weights

ML: What is a graph cut? For the purposes of graph clustering, what 2 propert do we want a cut to have?

A graph cut is simply putting the vertices into 2 (or more) subsets, but it can be visualized as “cutting” across the edges which go between your two subsets.

We want our cuts to :

- Have a low
*total weight of cut edges*. So if we’re using an unweighted graph, we want to cut as few edges as possible; if the graph is weighted, we want to cut as little weight as possible. This increases dissimilarity across clusters and similarity within clusters, as edges are drawn between similar points - We want cuts to be
*balanced*(or approximately balanced), meaning that a similar number of points is in each subset. This avoids putting one point in a subset, for example.

ML: What is a key advantage of graph clustering?

It is very good at capturing *non-blob-like clusters*, or clusters with atypical shapes. This can be very useful

ML: At an extremely high level, how does graph clustering work?

Form a graph from your training points, connecting points that are similar.

Then, make graph cuts which separate dissimilar points to form your clusters.

ML: What method is used to do graph clustering with more than 2 clusters?

Spectral embedding (it uses some fancy linear algebra that is related to graph-based dimensionality reduction, but not going to get into weeds here.)

402: What is the “optimal” regression function µ(x), and why is it optimal?

The optimal regression function µ(x) = E[Y|X=x]; it is the expected value of the true function y=f(x) at every point x.

This is the “optimal” regression function because it *minimizes the expected mean squared error*:

E[(Y-µ(x))^{2}]

402: What is the difference between parametric and non-parametric bootstrapping?

In parametric bootstrapping, we sample new simulated data from a model (which is parameterized, hence the name), and that model is learned from the actual data we have.

In non-parametric boostrapping, we don’t learn a model, and instead just resample the points we already have.

402: What are the 3 main types of bootstrapping learned in 402?

Model-based bootstrapping, residuals resampling (which still involves a model, but uses it differently), and case resampling

402: What purpose does bootstrapping serve, and how does it do so?

Bootstrapping allows us to get an *approximate sampling distribution for a statistic* of our data.

So typically we have a dataset of n values X_{i}, and from it we will calculate a variety of statistics: mean, standard deviation, regression coefficient for a model you’re fitting to the data, difference in MSE between 2 models you’re fitting, etc. But for each statistic, you get one value of the statistic per *dataset*, and to get an (approximate) *sampling distribution* for a statistic, you need to simulate many datasets, then use them to find the sampling distribution of the test statistic, which you calculate for every simulated dataset.

402: What is the Bootstrap Principle? In other words, what is the 3-step high-level/theoretical process for bootstrapping?

402: Why is it so often useful to use bootstrapping to get an approximate sampling distribution of a statistic?

We use statistics to make estimations about the underlying data: maybe want the mean, or a coefficient in a model, or the difference in MSE between 2 models. By finding an approximate sampling distribution of such a statistic, we can approximately *quantify the uncertainty around our estimated value.* So if we have a sample mean estimating the expected value of the data µ, we can use bootstrapping to *find a* *confidence interval, or a standard error, or a variance, etc.* And quantifying our uncertainty regarding these estimates is always a responsible thing to do!

402: How do we do bootstrapping with residuals resampling?

Residuals resampling: Here you have data in form of (x,y) pairs. You *learn a model* y~f*(x) (so this is actually parametric), and use it to get the n residuals from your original dataset. Then, for a resampled dataset, the x’s stay the same, but the y’s are f*(x) plus a resampled residual.

So to reiterate, each “simulated dataset” has the *same* x values, which are not simulated. What’s “simulated” is the y values: you find y from f(x), then add a resampled residual.

It really strikes me personally as a sort of *hybrid between a parametric and non-parametric technique.* You’re using a model, and generating a simulated value y from the model for each x. But then you get your residual, which is resampled in a quasi-non-parametric sort of way.

402: How do we do bootstrapping with case resampling?

This one’s pretty simple: each simulated dataset is n full, unaltered data points resampled from the original data, with replacement.

402: How do we do bootstrapping with model-based simulation?

We learn the parameters of some model from the data; then, we use the model to sample entire datapoints.

Example 1: Say we have 1-dimensional datapoints X_{i}. We might learn a normal distribution, then sample points from that distribution.

Example 2: Say we have datapoints of (x,y) pairs. Here we would need to generate a full (x,y) pair from a learned model. Perhaps, for example, we learn a normal distribution for x, and then a regression model for y given x. (We didn’t go into this much in class, and tended to use residuals or case resampling instead.)

402: How do we choose between our 3 bootstrapping options for, say, data in the form of (x,y) pairs? What factors are considered?

Our 3 options have varying levels of assumptions: model-based bootstrapping requires more assumptions about the underlying data than residuals resampling, and residuals resampling requires more assumptions than case resampling. (Case resampling basically assumes nothing; residuals resampling learns a model and assumes that the distribution of residuals is the same for any (x,y) pair; and model-based bootstrap assumes that it can make full (x,y) pairs from a model that the data actually fits).

More assumptions means more efficient use of the data *if you are correct*: you are fitting a more specific model if you use more assumptions. This means that your *confidence intervals will be tighter*, which is good.

So basically, you want to choose the model with the most assumptions *that you can still feel confident about*. We can rarely feel confident about model-based simulation, so we rarely pick it. If the residuals assumption appears true, we pick residuals resampling; otherwise, we just use case resampling and settle for wide confidence intervals.

(See 6.8 in the text, a brief section, for a complete discussion if desired)

402: At a high-level, how do we use kernels for density estimation of a pdf f(x)?

We “smooth out” each datapoint by putting a “kernel” at each datapoint to act as a “little pdf” for that point. The kernels will generally look like normal distributions; if f(x) is higher than 1-dimensional, then a multivariate normal. We then just sum all of the little pdfs together into an estimated real pdf (the sum is normalized so it’s a real probability dist).

The following picture is for a janky example where the width of the normal distributions was too small and where the data weren’t to enough decimal places, but it illustrates the idea:

402: What mathematical objects do we use for causal analysis?

DAGs (directed acyclic graphs)

402: In causal inference DAGs, what does a vertex represent, and what does a directed edge mean?

Each vertex is one of the features of your data being measured. For example, if X = [x_{1}, x_{2},…], then a vertex would be x_{1} or x_{2}.

An edge between 2 attributes means that the two attributes have a nonzero covariance, or a nonzero linear correlation. The direction defines the causal influence, so A -> B means A influences B.

402: In causal inference, why must our graphs be acyclic?

Because we assume that there are no causal cycles.

(for some reason; it seems logical at times. The reasoning was something about how it’s time based, and with A -> B, a change in A results in a change in B *later*, and we can’t “go back in time” or something.)

402: Given a correct DAG for causal inference, how can we make a regression that examines the causal impact of A on B?

We must control for a set of variables that satisfies the backdoor criterion.

In programming terms, this means we include all of the variables in the set, along with A, as predictors of B, and then look at the coefficient of A in predicting B.

402: When predicting B from A using a causal graph, what properties must a set of factors S have to satisfy the backdoor criterion?

S must:

- Block all backdoor paths from A to B (so any paths with an edge pointing out of A and into B, except for the “direct” path)
- Contains no descendants of A

402: What is a fork, a chain, and a collider?

Fork: A <- B -> C

Chain: A -> B -> C

Collider: A -> B <-C

402: When is a Fork (A <- B -> C), Chain (A -> B -> C), or Collider (A -> B <-C) open?

Forks and chains are open when B is not controlled for. Colliders are open when B *is* controlled for.