How do decision trees work?

A splitting of data down a tree. At each node, you’re trying to maximize the information gain (reduction in randomness) which is the split that splits the majority of the data as evenly as possible down the possible paths.

What’s the big O notation of XOR vs OR?

XOR is exponential: O(2^n), while OR is linear.

What is restriction bias?

When the set of possible hypotheses considered becomes restricted to a smaller subset.

What is entropy?

A measure of randomness. When you split a decision tree you’re trying to reduce the amount of randomness as much as possible. If after the split you end up with 1/2 of each class down each branch then there was no change in entropy (50x, 50y @ node1 -> node2: 25x, 25y, node3: 25x, 25y). If you completely classify something you’ve got max entropy reduction.

What is preference bias?

What sort of hypotheses from your hypothesis set do you prefer? This is the heart of inductive bias.

What’s the inductive bias of DTs?

- Good splits at the top
- Prefers correct to incorrect tree’s
- Prefers shorter to longer trees - comes naturally from doing good splits at the top. If you do bad splits, the tree’s would be longer

How do DT’s treat continuous attributes?

Pose it as a question: 20 <= age < 30 then split on the answer. You can split multiple times on the same continuous variable, but you can’t ask the same question twice. Information gain should pick that up anyway though because you wouldn’t see a reduction in entropy by choosing that selection.

What is pruning?

Collapse leaves back up on the training set and label everything via maybe the avg, or vote, etc. if it doesn’t increase the error too much then do it, if it does, don’t do it. This helps prevent overfitting.

Could you use DTs for regression?

Sure - outputs could be something like average or local linear fit. When you split you could look at the variance.

What is cross validation?

Split your training data into subsets (k folds of them). Then use all but one of those sets as your training data and perform validation against the one remaining subset. Repeat this so that each subset becomes the validation set and then you will average the test scores.

What is IID?``

Independent and Identically Distributed. All data is coming from the same distribution/source. Each pull from the distribution is independent of each other and they are all distributed the same.

How do you get a ‘not’ behavior from a perceptron?

Set the weight to negative (or the threshold)!

What boolean functions can be represented by a single perceptron?

AND, OR, NOT, NAND, NOR - you cannot represent XOR with a single perceptron.

What is the perceptron training rule?

w_i += delta_w_i

delta_w_i = eta*(y - y_hat)*x_i

y = target y_hat = output eta = learning rate

This is a thresholded training rule, eg, the outputs will be on or off.

What does linear separability mean to the perceptron training rule?

If the data IS linearly separable, then the perceptron training rule WILL find that separation in a finite number of iterations.

What does linearly separable mean?

All data can be subdivided by half planes with no misclassification. Technically you can make something linearly separable with N - 1 planes where N is the number of points.

What is gradient descent?

Learning rule for NN for when the data is not-linearly separable. It will trend toward the target concept over time but will only converge toward a local optima.

delta_w_i = eta*(y-a)*x_i

eta = learning rate y = target output a = sum(w_i*x_i)

The outputs of these are not thresholded.

What’s the difference b/t the perceptron rule and the gradient descent rule?

perceptron: eta*(y-y_hat)*x_i

gradient descent: eta*(y-sum(w_i*x_i)*x_i

perceptron learning is done on the thresholded output while gradient descent is done on the non-thresholded output.

What is the sigmoid activation function?

S shaped threshold:

sigma(a) = 1 / (1+e^-a)

as a -> -inf, sigma(a) -> 0

as a -> inf, sigma(a) -> 1

This is differentiable.

What is the restriction bias of NNs (What set of hypotheses will be consider)?

perceptron: half spaces

sigmoids: much more complex

Not much restriction. There is a real danger of overfitting though with too complex of a network. Use CV to help avoid this.

What kind of functions can you represent with a NN of various layers?

Boolean: networks of threshold like units

Continuous (no discontinuities): 1 hidden layer with enough hidden units. Each hidden unit can worry about one little patch of the function and they get patched together at the output.

Arbitrary - with discontinuities: 2 hidden layers

What is the preference bias of NNs (which hypotheses you would prefer given the options)?

Prefer small weights (because larger weights == more complexity - occams razor)

What is instance based learning?

Lazy learner - store all of the training data. Don’t fit or do anything like that. Computation is done during prediction instead of fitting.

For KNN, what is distance?

It’s really a ‘similarity’ metric. But it could be defined as something like as the crow flies or manhattan distance.

Which components in KNN represent your domain knowledge?

The distance metric and the number of neighbors.

How does classification work for KNN?

Find the k nearest neighbors as determined by your distance metric. Then you could use voting or a weighted vote.

How does regression work for KNN?

Find the k nearest neighbors as determined by your distance metric. Then take the mean of the y values for the nearest neighbors (or weighted mean, etc).

What’s the time complexity of KNN for learning and queries?

For learning it’s always constant.

query time is log_2+k time because you can basically find the query point in log_2(n) time and then bounce back and forth to find your k points.

if k is on the order of n/2 then it would dominate the log_2 and become O(n) instead of O(log(n))

What is the preference bias for KNN?

- Locality: assumes near points are similar
- Smoothness: averages
- All features matter equally (because of the distance metric) Example: your real function is x1^2 + x2, then in your distance function you could square the difference in x1 instead just using the linear difference.

What’s the curse of dimensionality?

As the number of features (dimensions) grows, you need exponentially more data to generalize accurately.

This applies to all ML problems. The idea being that you want your data points to represent the same ‘amount’ of the dimensional space.

1D:

| x x x | must go to:

2D:

| x x x |

| x x x |

| x x x |

When you add another feature.

What is locally weighted regression (KNN)?

Performing a local regression on the nearest neighbors where the closest ones have the most influence. So you’re building concepts around the local neighbors.

What’s the theorem of no free lunch?

The theory of NFL says that averaged over all problems, no method will do better than random

What is ensemble learning?

Use weak learners to create a bunch of small rules over small parts of the dataset (random small subset) and then combine them all to create a complex rule.

What is bagging/bootstraping?

Create weak learners and train them all on small random subsets of the data, then average the hypotheses learned by each.

What is a weak learner?

No matter what the distribution is, they will do better than chance. Error rate should always be strictly < 0.5

What is boosting?

Start with a distribution of 1/n.

Train a weak learner on the dataset then when we redistribute the likelihood of drawing samples from D such that the next weak learner has higher probability of drawing the samples the previous learner got wrong. Repeat this process for many weak learners. Make a weighted combination of the weak learners in the end to form a final hypothesis.

Boosting says that your classification labels are {-1, +1}

How is boosting like other algorithms?

Like KNN because it took simple hypotheses and made a more complex one (local weighted regression in KNN).

Weighted combination of things is like NN’s.

Why is Boosting non-linear?

Because you pass it through the sgn function at the end which is non-linear.

Why does boosting tend to do well? What’s the intuition?

You’re using weak learners that must get at least 50% of the classifications correct. At each step you re-distribute such that you put more weight on the likelihood of drawing samples that the previous learner got incorrect. So the next weak learner is focusing more on the problems the previous learner got wrong, and they must get > 50% of them correct. This means you should always be improving.

How do SVM’s work?

The goal is to maximize the margin between the data subject to being able to properly classify the data on either side of that decision boundary.

try to minimize 1/2 * || w ||^2 where w is a vector containing your hyperplane parameters.

Ultimately this turns into a quadratic programming problem, and once you solve it you can recover w as:

w = sum(alpha_i * y_i * x_i), but many alphas are 0 so most of your data doesn’t matter, only a few of the x_i’s do. The data that IS left are your support vectors.

How is SVM like KNN?

Only the local points matter (near the decision boundary). Also like KNN except you’ve already done the work of figuring out which points matter and you can throw away the rest.

Like instance based learning except you’ve already put some energy into figuring out which points matter.

What is a notion of similarity in SVM?

In the quadratic equation, there is a term: xT_i*x_j which is the dot product of the vectors. This tells you whether they are orthogonal (0), in the same direction (large +) or opposite directions (large -). They are a form of similarity.

What is the kernel trick?

Transforming your data into a higher dimensional space without actually have to go there.

What is an SVM kernel?

K(x_i, x_j), Your similarity metric. It’s where you inject domain knowledge. This is a dot product that projects your data into a higher dimensional space where it can (could?) be linearly separable.

You don’t actually have to change your data into that dimension, though, because of the kernel trick.

The kernel should really capture your domain knowledge.

The datapoints (x_i, x_j) can be continuous or discrete!

What is the Mercer Condition?

“It acts like a distance or a similarity” which means it’s a well behaved distance function.

How is boosting related to SVM?

As you add more learners, your error doesn’t necessarily change but you increase your confidence. This is like the margin in SVM.

Boosting after normalizing dividing by sum(alpha_t)

|-__-_-_____|__+____+___++|

|—________|_________++++|

| margin |

Why do you want large margins in SVM?

Larger margins tend to minimize overfitting.

When would boosting overfit?

If the weak learners overfit! A bunch of things that overfit when combined can only overfit as well.

The other case is when there is pink noise (uniform noise)

What are mistake bounds?

How many misclassifications can a learner make over an infinite run.

1) Assume possible each variable is positive AND negated (A && NOT A && B && NOT B && …)

2) Given input, compute output, you guess FALSE

3) first time you’re wrong, keep bit pattern in your positive/negated table.

4) every time you’re wrong thereafter, set all positive variables that were 0 to absent, negative variables that were 1 to absent, goto 2

The idea is that whenever you see a true response, any bit patterns that are flipped when compared to the previous true response must be absent because they don’t contribute to the output.

You will make k + 1 mistakes at most (k being number of hypotheses).

What is computational complexity?

How much computational effort is required for a learner to converge. (Not really discussing this topic)

What is sample complexity?

How many training examples are needed for a learner to create a successful hypothesis.

What are version spaces?

Set of all hypotheses that are consistent with the data.

What is a consistent learner?

Of all possible hypotheses that the learner could return, it returns the one that matches the data it has seen so far. It’s consistent with the data.

What is the training error vs True error of a hypothesis?

Training Error: fraction of training examples misclassified by h

True Error: fraction of examples that __would be__ misclassified on sample drawn from D.

error(h) = Pr[c(x) != h(x)] - You’re being penalized proportional to that thing being wrong.

“You are not being penalized for misclassifying samples you don’t ever see”

Describe the components of PAC learning

C: Concept Class

L: Learner

H: Hypothesis Space

n: |H|, size of hypothesis space

D: distribution over inputs

0 <= epsilon <= 1/2: error goal - error in hypothesis should be less than epsilon

0 <= delta <= 1/2: failure probability (with prob (1 - delta), algorithm must be able to produce a true error <= epsilon)

Probably Approximately Correct

(1 - delta), epsilon, h(x) = c(x)

When is something PAC learnable?

C (Concept) is PAC learnable by L (Learner) iff L will with probability (1-delta) output a hypothesis such that error(h) <= epsilon in time and samples polynomial in 1/epsilon, 1/delta, and n

When is the version space epsilon exhausted?

The version space is epsilon exhausted when all the hypotheses in the version space will produce an error(h) <= epsilon.

What is Haussler’s theorem?

A way to bound the true error as a function of the number of training examples drawn.

The minimum number of samples to guarantee you’ve epsilon exhausted your version space is:

m >= 1/eps * (ln |H| + ln 1/delta)

This bound is agnostic to the distribution!

What are VC dimensions?

Answers the question: What is the largest set of inputs that the hypothesis class can [label in all possible ways] (eg, shatter)?

You only must prove it can shatter one collection of points (all combinations) but to prove you can’t shatter, you just need to show one combination that doesn’t work.

How do you find the VC dimension?

You put some number of points down and then no matter how you label those points, you must be able to separate them.

For a d-dimensional hyperplane, VC = d + 1

How do VC dimensions relate to PAC?

H is PAC-learnable if and only if VC dimension is finite. You can’t PAC learn it if the VC dimension is infinite.

What is the VC of finite H space?

d = VC(H) <= log_2 |H|

How does the VC dimension relate to sample complexity?

Like the Haussler’s theorem, except instead of ln |H| you use VC(H)… and there are many other changes.

The point is you can calculate a minimum number of samples to be epsilon exhausted if VC(H) is finite.

What’s Bayes Rule?

P(A|B) = (P(B|A) * P(A)) / P(B)

Allows you to swap what’s on the sides of the bar’s.

Describe bayesian learning

Putting probabilities on various hypotheses.

P(h|D) = P(D|h) * P(h) / P(D)

• in general we drop the `/ P(D)`

though because it doesn’t affect the argmax

h_map = argmax_h P(h|D) = argmax_h P(D|h) * P(h)

What’s the MAP hypothesis?

Maximum a posteriori

h_map = argmax_h P(h|D) = argmax_h P(D|h) * P(h)

What is the ML hypothesis?

Maximum Likelyhood hypothesis

h_ml = argmax_h P(D|h)

This used to contain a P(h) as well, but that basically gets ignored because we assume a uniform distribution over h, so it becomes some constant and it doesn’t really matter what it is. So it basically also doesn’t affect the argmax because it would be the same for everything.

It is also the MAP you get when you have a uniform prior.

How does bayesian learning and information theory fit together?

Transform L_map = argmax_h P(D|h)*P(h) using lg:

argmin_h [-lg P(D|h) -lg P(h)]

where each of these represents a length a la information theory.

so the lg P(h) is like the length of your hypothesis (‘size’ of h) and the lg P(D|h) is a notion of misclassification or ‘error’.

eg it minimizes error using the simplest hypothesis. There is usually a tradeoff, though.

This is called minimum description length.

How does bayesian classification work?

You use a weighted vote for all the h in H.

Eg, weighted vote of Pr(h|D) for all h.

Example:

h1: + -> 0.4

h2: - -> 0.3

h3: - -> 0.3

What’s the MAP? h1, so +

But what’s the most likely label (bayes classification)? minus (-). This is known as bayes optimal classifier.

What is conditional independence?

X is conditional independent of Y given Z if:

for All x, y, and z

P(X=x | Y=y, Z=z) = P(X=x | Z=z)

Also written as: P(X | Y, Z) = P(X | Z)

This means that the P(X) does not depend on Y if Z is given, hence, P(X | Y, Z) -> P(X | Z) … so you can ignore Y when talking about the probability of X.

What is normal independence?

“The joint distribution between two variables is equal to the product of their marginals” - CL

Independence: P(x, y) = P(x) * P(y)

Chain Rule: P(x, y) = P(x | y) * P(y)

Setting them equal:

P(x) * P(y) = P(x | y) * P(y)

P(x) = P(x | y)

What are belief networks?

Dependency graphs or probabilities.

Storm -> Lightening -> Thunder

means:

- Lightening is dependent on Storm
- Thunder is dependent on Lightening
- But Thunder is conditionally independent of storm, given lightening (b/c there is no arrow from storm to lightening)
- If there was an arrow from Storm -> Thunder then Thunder would also be dependent on Storm.

How do you recover the joint distribution from a belief network?

You multiple the probabilities of the entire graph back together:

A -> C -> D

B -> C

B -> D

A = P(A) : 1 prob B = P(B) : 1 prob C = P(C | A, B) : 4 probs D = P(D | B, C) : 4 probs

Joint = P(A) * P(B) * P(C | A, B) * P(D | B, C)

The number of probabilities at each node is exponential in the number of immediate parents.

Why sample in Bayesian Learning?

- to simulate a complex process
- approximate inference - rather than back compute some value, just sample from that scenario - plus it’s faster
- visualization - get a feel for the data

What is marginalization (stats)?

Represent the probability of a value by summing over another variable and looking at the joint probabilities.

Example:

P(x | sum_y(P(x, y))

sum of:

(x///( ) y) places where x and not y

(x (//) y) and places where x and y

Another example:

P(x) = P(x, y) + P(x, not y)

What is Naive Bayes?

A special case of a belief network where you’ve got a root node (label) producing many different attributes values. Each attribute has no children and only has the root node as its single parent.

P(V | a1, a2, a3…) = prod_i(P(a_i | V) * P(V) / Z

MAP Class = argmax_V P(V) * prod_i(P(a_i | V)) - The most likely class given the data that we’ve seen.

Naive bayes can handle missing attributes.

Why is Naive Bayes powerful?

- Num. of parameters you need to know is linear. 1 prob for the root node (label/class) and 2 for each attribute.
- You can estimate the params with labeled data as:

P(a_i | V) = # a_i, V / # V => how often do you have that attribute in that class divided by the number of times that class showed up.

• Can do inference in both directions

Why is Naive Bayes… naive?

It assumes all the attributes are conditionally independent of each other! (root note -> attributes). It doesn’t model interrelationships between attributes.

It’s OK because we don’t care about the absolute probabilities that come out of it, only that we get the correct label (so ordering of probabilities is what matters).

Describe random restart hill climbing

- Guess some value, x
- Check both sides of your current position
- Move to the highest of your neighboring points
- Repeat until both of your neighbors are lower than you
- Once you’ve reached your local maximum, restart in some other random position and repeat
- Do this some number of iterations and output the best result
- In general this is an exploitation algorithm because it’s always going uphill. Maybe we could call the random restarts the exploration.

When might random hill climbing have trouble finding the solution?

If you’ve got a wide basin of attraction on a local optima (that is not the global) then the majority of the time you’re not reaching the global optima, especially if the global optima has a narrow basin of attraction. It may take a very large number of iterations to find it.

Describe simulated annealing

A random search algorithm that trades off exploration and exploitation by starting off with more exploration and moving toward exploitation over time.

Algorithm:

• Sample a new point, x_t in neighborhood N(x)

• Compute probability of moving to that point as:

P(x, x_t, T):

• 1 if f(x_t) >= f(x) (neighbor is >= to you, make the move)

• e^([f(x_t) - f(x))]/T)

• Decrease Temperature, T

f(x_t) - f(x) will always be negative, so you’ve really got 1/e^(difference/T)

- When T is large the exponent goes toward 0 and 1/1 = 1, so will move downhill.
- As the temperature decreases, exponent becomes very small, which magnifies the exponent, making the probability very small.

So you start out with more exploration, accepting many downhill neighbors when T is large and accepting less downhill neighbors choosing mostly to move uphill as T gets smaller.

What are the properties of simulated annealing?

- When T => 0: like hill climbing
- When T => inf: random walk
- In general, you should decrease T slowly

P(ending at some point, x) = e^(f(x)/T) / Z_t (boltzmann distribution)

So more probability mass is put on that point as the Temperature goes down.

Describe Genetic Algorithms

A parallel random search algorithm that maintains population of searchers that ‘evolve’ over time in hopes of landing on a good solution.

Algorithm:

Create a random population

• compute fitness for each member

• select most fit individuals (cull the heard)

•pair up individuals and breed (crossover - a notion of locality) to replenish the population

• allow ‘mutation’ to occur (local search)

What is one point crossover (GA’s)?

A method of generating ‘children’ from ‘parents’ where the children have n bits from one parent and length-n bits of the other parent. Each child has the opposite set of bits from each parent.

This does imply some kind of bias about the locality of the bits - that some bit groups should stay together.

What is Uniform Crossover (GA’s)?

Basically just randomizing whether you keep or flip the bits at each bit position.

Describe the MIMIC algorithm

- Generate samples from P(x) theta_t (starts out uniform)
- set theta_t+1 to n_th percentile (fittest)
- retain only those samples such that f(x) >= theta_t+1
- Estimate the new distribution, P(x) theta_t+1
- Repeat

You’re taking theta_min toward theta_max over time.

It’s like the water rising around a bunch of mountainous islands. The peaks remain visible and the water keeps rising until the peak(s) become single point(s) (if equal height peaks).

How is MIMIC like genetic algorithms?

- generating samples from P(x) theta_t : this is like your population
- keeping the nth percentile of samples with f(x) >= theta_t: is like culling the population in GA

What are dependency trees and how are they used in MIMIC?

A belief network where each node in the tree has exactly one parent (except for the root). Each variable then depends on exactly one other variable.

They are used used to estimate the distribution P(x) theta_t. By using dependency trees we are assuming each variable will have at most one dependent variable.

Typically we would have:

•P(x) = P(x1 | x2,… xn) * P(x2 | x3 … xn) … * P(xn)

With dependency trees, it becomes:

•P(x)_pi = prod(P(x_i | pi(x_i))

Where pi is the parent of x_i.

Does MIMIC require the use of dependency trees?

No - using dependency trees is just the easiest thing to do while still allowing you to represent dependent relationships. You could use bayesian networks or unconditional probability distributions, etc.

What is KL divergence and how is it used (in the context of MIMIC and distributions)?

It is a method of describing the similarity between two distributions. A measure of the divergence between the underlying distribution P and the candidate distribution, Phat_pi.

The goal in MIMIC is to minimize the difference between the dependency tree generated (Phat_pi) and the underlying distribution (P).

- D_kl (P || Phat_pi) = sum( P( lg(P) - lg (Phat_pi) )
- max J_pi = Sum( I( x_i ; pi(x_i)) ): mutual information

The goal is to maximize the mutual information between each feature and its parents. eg, the best dependency tree is the one that captures dependencies the best.

How do you find a dependency tree for MIMIC?

- Start with a fully connected graph between all attributes.
- Find the maximum spanning tree (use Prim)!

Ultimately you start with the fully connected graph with mutual information (which is bidirectional) on each edge. Then after applying Prim it is reduced to the maximum spanning tree which is the spanning tree that has the maximum weight.

When does MIMIC perform well?

- Does well with structure. When the optimal values you care about depend on the structure as opposed to specific values. You get a lot of information at each iteration.
- MIMIC is also useful when the fitness function is very complex to calculate. Other search methods that required hundreds of thousands of iterations would need to evaluate this complex function over and over while MIMIC would do it in orders of magnitude less.

What are the downfalls of MIMIC?

- can still get stuck in local optima, but in general it tends not to
- It’s execution time for single iterations is slow, but it doesn’t have to do many iterations as compared to other randomized search algorithms.

What is mutual information?

A measure of the reduction of randomness of one variable given knowledge of some other variable.

I(x, y) = H(y) - H( y | x )

What is joint entropy?

Randomness contained in two variables together.

•H(x, y) = - sum( P(x, y) log(P(x, y) )

What is conditional entropy?

Measure of randomness of one variable given another variable.

H(y | x) = - sum( P(x, y) log(P( y | x) )

What is the conditional and joint entropy if the variables are independent?

Conditional: H(y | x) = H(y)

Joint: H(y, x) = H(y) + H(x)

What is entropy?

A measure of randomness or information.

•- sum(P(s) * lg P(s))