Algorithms Flashcards
(20 cards)
Define KNN.
Non-parametric supervised learning method. Identify k points closer to x (N). Classification - estimate conditional probability for class j as fraction of points in N whose response value in j. Regression - mean of response value of points in N Highly depedent on choice of K.
How does K affect bias/variance in KNN.
Small K - unstable or flexible decision boundary, low bias, high variance -> training error rate close to 0
Large K - less flexible, decision boundary is close to linear, high bias, low variance
How is training and testing error change as K increases in KNN?
As K increases -> flexibility reduces -> training error increases. at K=0, training error = 0 for classification
As K increases -> test error initially decreases but as method becomes more flexible it might overfit and as some point test error will increase (U shape)
Define decision trees.
Tree based methods involves stratifying or segmenting the predictor space into simpler regions. To make prediction for a given observation, we typically use mean or mode response value for the training observations in the region to which it belongs.Since the set of rules of the splitting regions can be summarized in a tree, this method is called decision trees.
How are decision trees created?
Decision trees are created using a top-down, greedy approach called recursive binary splitting. It is called top-down because it begins at the top of the tree where are observations belong to one region, then success splits are made on the predictor space. Each split is indicated via two branches further down on the tree. It is greedy because at each step of the tree building process, best split is made for that particular step rather than looking ahead and picking a split that will lead to a better tree further down the step.
Explain feature split for a regression decision tree.
We consider all predictors X1, X2 … Xp and all possibles values of cutpoint s for each of the predictor. We choose the predictor and the cutpoint such that the resulting tree has greatest possible reduction in Residual sum of squares.
Disadvantages of vanilla decision trees.
- high variance - slight change in training data can lead to major changes in decision tree causing instability
- overfitting - reads into the noise in training data, resulting tree might be too complex, poor test set performance
3 elements in creation of decision trees.
- Selection of split i.e. - how do we decide on which node to split and how to split?
- If we know how to make a split or grow a tree, how do we decide when to declare a node terminal and stop splitting?
- We have to assign each node a class. How do we assign it?
advantages of classification decision tree.
- robust to outliers. do not calculate avg or anything.
- works well on both categorical and continuous data.
- easy to interpret.
Decision tree pruning
It’s a strategy that reduces the size of decision tree by removing sections of tree that are non-critical and redundant to classify instances or making predictions. reduces complexity of model and reduces over-fitting.
Why is pruning important?
One challenge in decision tree is finding optimal size of the tree. Large trees suffer with the problem of overfitting and being too complex while small tree might not capture important structural information from the sample space. Common strategy is to grow the tree to maximum and then prune it back i.e. remove nodes which are non-critical and redundant.
Cost complexity pruning.
– some background on pruning, and subtrees
– Rather than considering all subtrees, we consider a small subset of subtrees indexed a non-negative tuning parameter (alpha).
measure = RSS + alpha * number of leaf nodes
alpha controls trade off between size of the tree and its fit to the training data
alpha = 0, subtree is same as tree
as alpha increases, there is a cost to pay for having a tree with many terminal nodes and hence it
Gini index.
Gives measure of total variance across K classes.
1 - (squared sum of probability of each class)
takes small value if all prob are small or 0
– hence also called a measure node purity. small value represents that the node predominantly contains observation from one class.
– used to evaluate quality of a split in decision tree.
– might be used when pruning the tree but use misclassification error rate if prediction accuracy is the goal.
Entropy
= -1 * sum(over all class) * prob * log (prob)
– takes small value if a node is pure
why is gini index used instead of misclassification error rate.
Gini index is more sensitive to the changes node probabilities than error rate. It’s possible that two splits might have the same error rate for a particular node, but one might be purer than the other.
Other case - avg misclassification rate of splits is same as parent node and in that case using error rate would declare the node as terminal. while using gini index would continue splitting the node until child nodes are purer.
Explain bagging in decision trees.
Construct B decision trees using B bootstrapped training dataset. (generated by taking repeated samples from the training data). Resulting predictions are averaged.
Trees are grown deep and not pruned. Each tree has high variance but low bias. Averaging the trees reduces variance.
estimate error using validation or cross-validation set.
What is the probability of an observation not being selected in a bootstrapped sample?
(1-1/n)^n
~ 1/3
(each bagged tree makes use of 2/3 of observations)
explain out-of-bag error estimation for bagged trees.
each bagged tree makes use of 2/3 observations. rest 1/3 can be said out of bag observations for this tree.
For making prediction for ith observation, we consider all trees where ith was in out of bag observations. This will yield around B/3 predictions, take avg of all predictions.
Do this exercise for all n observations from which overall OOB can be estimated.
– valid estimate of the test error, since response is predicted by only using tress where this observation was not used for training.
Difference between bagged trees and random forest.
In random forest, decision trees are created using bootstrapped samples. But when building these trees, each time a split is considered, a sample of m predictors is used as split candidates from full set of p predictors. Split is allowed to use only one one candidate from these m predictors. m = square root of p.
This leads to decorrelated trees.
How does random forest reduce decorrelation in trees?
Example - assume there is a very strong predictor along with a number of moderately strong predictors. In bagged trees, most or all will use this strong predictor in top split. Hence trees will look quite similar to each other. Averaging over these highly correlated trees will not lead to as large of a reduction in variance as averaging over decorrelated trees. Since rf limits the split to only a subset of predictors, there will be (p-m)/p splits not using the strong predictor.