Midterm Prep Flashcards Preview

CS 7641 - OMSCS > Midterm Prep > Flashcards

Flashcards in Midterm Prep Deck (98)
Loading flashcards...

How do decision trees work?

A splitting of data down a tree. At each node, you're trying to maximize the information gain (reduction in randomness) which is the split that splits the majority of the data as evenly as possible down the possible paths.


What's the big O notation of XOR vs OR?

XOR is exponential: O(2^n), while OR is linear.


What is restriction bias?

When the set of possible hypotheses considered becomes restricted to a smaller subset.


What is entropy?

A measure of randomness. When you split a decision tree you're trying to reduce the amount of randomness as much as possible. If after the split you end up with 1/2 of each class down each branch then there was no change in entropy (50x, 50y @ node1 -> node2: 25x, 25y, node3: 25x, 25y). If you completely classify something you've got max entropy reduction.


What is preference bias?

What sort of hypotheses from your hypothesis set do you prefer? This is the heart of inductive bias.


What's the inductive bias of DTs?

• Good splits at the top
• Prefers correct to incorrect tree's
• Prefers shorter to longer trees - comes naturally from doing good splits at the top. If you do bad splits, the tree's would be longer


How do DT's treat continuous attributes?

Pose it as a question: 20 <= age < 30 then split on the answer. You can split multiple times on the same continuous variable, but you can't ask the same question twice. Information gain should pick that up anyway though because you wouldn't see a reduction in entropy by choosing that selection.


What is pruning?

Collapse leaves back up on the training set and label everything via maybe the avg, or vote, etc. if it doesn't increase the error too much then do it, if it does, don't do it. This helps prevent overfitting.


Could you use DTs for regression?

Sure - outputs could be something like average or local linear fit. When you split you could look at the variance.


What is cross validation?

Split your training data into subsets (k folds of them). Then use all but one of those sets as your training data and perform validation against the one remaining subset. Repeat this so that each subset becomes the validation set and then you will average the test scores.


What is IID?``

Independent and Identically Distributed. All data is coming from the same distribution/source. Each pull from the distribution is independent of each other and they are all distributed the same.


How do you get a 'not' behavior from a perceptron?

Set the weight to negative (or the threshold)!


What boolean functions can be represented by a single perceptron?

AND, OR, NOT, NAND, NOR - you cannot represent XOR with a single perceptron.


What is the perceptron training rule?

w_i += delta_w_i
delta_w_i = eta*(y - y_hat)*x_i

y = target
y_hat = output
eta = learning rate

This is a thresholded training rule, eg, the outputs will be on or off.


What does linear separability mean to the perceptron training rule?

If the data IS linearly separable, then the perceptron training rule WILL find that separation in a finite number of iterations.


What does linearly separable mean?

All data can be subdivided by half planes with no misclassification. Technically you can make something linearly separable with N - 1 planes where N is the number of points.


What is gradient descent?

Learning rule for NN for when the data is not-linearly separable. It will trend toward the target concept over time but will only converge toward a local optima.

delta_w_i = eta*(y-a)*x_i

eta = learning rate
y = target output
a = sum(w_i*x_i)

The outputs of these are not thresholded.


What's the difference b/t the perceptron rule and the gradient descent rule?

perceptron: eta*(y-y_hat)*x_i
gradient descent: eta*(y-sum(w_i*x_i)*x_i

perceptron learning is done on the thresholded output while gradient descent is done on the non-thresholded output.


What is the sigmoid activation function?

S shaped threshold:

sigma(a) = 1 / (1+e^-a)

as a -> -inf, sigma(a) -> 0
as a -> inf, sigma(a) -> 1

This is differentiable.


What is the restriction bias of NNs (What set of hypotheses will be consider)?

perceptron: half spaces
sigmoids: much more complex

Not much restriction. There is a real danger of overfitting though with too complex of a network. Use CV to help avoid this.


What kind of functions can you represent with a NN of various layers?

Boolean: networks of threshold like units
Continuous (no discontinuities): 1 hidden layer with enough hidden units. Each hidden unit can worry about one little patch of the function and they get patched together at the output.
Arbitrary - with discontinuities: 2 hidden layers


What is the preference bias of NNs (which hypotheses you would prefer given the options)?

Prefer small weights (because larger weights == more complexity - occams razor)


What is instance based learning?

Lazy learner - store all of the training data. Don't fit or do anything like that. Computation is done during prediction instead of fitting.


For KNN, what is distance?

It's really a 'similarity' metric. But it could be defined as something like as the crow flies or manhattan distance.


Which components in KNN represent your domain knowledge?

The distance metric and the number of neighbors.


How does classification work for KNN?

Find the k nearest neighbors as determined by your distance metric. Then you could use voting or a weighted vote.


How does regression work for KNN?

Find the k nearest neighbors as determined by your distance metric. Then take the mean of the y values for the nearest neighbors (or weighted mean, etc).


What's the time complexity of KNN for learning and queries?

For learning it's always constant.

query time is log_2+k time because you can basically find the query point in log_2(n) time and then bounce back and forth to find your k points.

if k is on the order of n/2 then it would dominate the log_2 and become O(n) instead of O(log(n))


What is the preference bias for KNN?

• Locality: assumes near points are similar
• Smoothness: averages
• All features matter equally (because of the distance metric) Example: your real function is x1^2 + x2, then in your distance function you could square the difference in x1 instead just using the linear difference.


What's the curse of dimensionality?

As the number of features (dimensions) grows, you need exponentially more data to generalize accurately.

This applies to all ML problems. The idea being that you want your data points to represent the same 'amount' of the dimensional space.

| x x x | must go to:

| x x x |
| x x x |
| x x x |

When you add another feature.