Midterm Prep Flashcards
(98 cards)
How do decision trees work?
A splitting of data down a tree. At each node, you’re trying to maximize the information gain (reduction in randomness) which is the split that splits the majority of the data as evenly as possible down the possible paths.
What’s the big O notation of XOR vs OR?
XOR is exponential: O(2^n), while OR is linear.
What is restriction bias?
When the set of possible hypotheses considered becomes restricted to a smaller subset.
What is entropy?
A measure of randomness. When you split a decision tree you’re trying to reduce the amount of randomness as much as possible. If after the split you end up with 1/2 of each class down each branch then there was no change in entropy (50x, 50y @ node1 -> node2: 25x, 25y, node3: 25x, 25y). If you completely classify something you’ve got max entropy reduction.
What is preference bias?
What sort of hypotheses from your hypothesis set do you prefer? This is the heart of inductive bias.
What’s the inductive bias of DTs?
- Good splits at the top
- Prefers correct to incorrect tree’s
- Prefers shorter to longer trees - comes naturally from doing good splits at the top. If you do bad splits, the tree’s would be longer
How do DT’s treat continuous attributes?
Pose it as a question: 20 <= age < 30 then split on the answer. You can split multiple times on the same continuous variable, but you can’t ask the same question twice. Information gain should pick that up anyway though because you wouldn’t see a reduction in entropy by choosing that selection.
What is pruning?
Collapse leaves back up on the training set and label everything via maybe the avg, or vote, etc. if it doesn’t increase the error too much then do it, if it does, don’t do it. This helps prevent overfitting.
Could you use DTs for regression?
Sure - outputs could be something like average or local linear fit. When you split you could look at the variance.
What is cross validation?
Split your training data into subsets (k folds of them). Then use all but one of those sets as your training data and perform validation against the one remaining subset. Repeat this so that each subset becomes the validation set and then you will average the test scores.
What is IID?``
Independent and Identically Distributed. All data is coming from the same distribution/source. Each pull from the distribution is independent of each other and they are all distributed the same.
How do you get a ‘not’ behavior from a perceptron?
Set the weight to negative (or the threshold)!
What boolean functions can be represented by a single perceptron?
AND, OR, NOT, NAND, NOR - you cannot represent XOR with a single perceptron.
What is the perceptron training rule?
w_i += delta_w_i
delta_w_i = eta(y - y_hat)x_i
y = target y_hat = output eta = learning rate
This is a thresholded training rule, eg, the outputs will be on or off.
What does linear separability mean to the perceptron training rule?
If the data IS linearly separable, then the perceptron training rule WILL find that separation in a finite number of iterations.
What does linearly separable mean?
All data can be subdivided by half planes with no misclassification. Technically you can make something linearly separable with N - 1 planes where N is the number of points.
What is gradient descent?
Learning rule for NN for when the data is not-linearly separable. It will trend toward the target concept over time but will only converge toward a local optima.
delta_w_i = eta(y-a)x_i
eta = learning rate y = target output a = sum(w_i*x_i)
The outputs of these are not thresholded.
What’s the difference b/t the perceptron rule and the gradient descent rule?
perceptron: eta(y-y_hat)x_i
gradient descent: eta(y-sum(w_ix_i)*x_i
perceptron learning is done on the thresholded output while gradient descent is done on the non-thresholded output.
What is the sigmoid activation function?
S shaped threshold:
sigma(a) = 1 / (1+e^-a)
as a -> -inf, sigma(a) -> 0
as a -> inf, sigma(a) -> 1
This is differentiable.
What is the restriction bias of NNs (What set of hypotheses will be consider)?
perceptron: half spaces
sigmoids: much more complex
Not much restriction. There is a real danger of overfitting though with too complex of a network. Use CV to help avoid this.
What kind of functions can you represent with a NN of various layers?
Boolean: networks of threshold like units
Continuous (no discontinuities): 1 hidden layer with enough hidden units. Each hidden unit can worry about one little patch of the function and they get patched together at the output.
Arbitrary - with discontinuities: 2 hidden layers
What is the preference bias of NNs (which hypotheses you would prefer given the options)?
Prefer small weights (because larger weights == more complexity - occams razor)
What is instance based learning?
Lazy learner - store all of the training data. Don’t fit or do anything like that. Computation is done during prediction instead of fitting.
For KNN, what is distance?
It’s really a ‘similarity’ metric. But it could be defined as something like as the crow flies or manhattan distance.