Chapter 6- Feature Selection Flashcards
what is the aim of feature selection?
automatically identify meaningful smaller subsets of feature variables
why do different types of models have different best feature sets?
different models draw different types of boundaries and allow different degrees of flexibility when you change their parameters
for d features, how many possible feature sets are there?
2^d
what is combinatorial optimisation?
finding the right point in a binary search space
give the steps of wrapper method for feature selection
start with an initial guess for a good set of features
train and test a model (maybe cross val)
if your test error is deemed good enough, stop
otherwise, choose a new set of features and go to line 2
name some wrapper methods
greedy search
genetic algorithm
simulated annealing
branch and bound
what is forward selection?
add features greedily and sequentially. Find which of the remaining ones improves our model the most and add it permanently to our set
what is backward elimination?
sequentially evaluate removing features and discard the one that damages performance the least.
what is stepwise, or floating selection?
wrapper method that combines forward and backward selection
two steps forward and one step back
what are filter methods?
find out how useful a feature is without training any models
Describe the pearsons correlation coefficient equation in words
covariance of the two variables divided by the product of their standard deviations
pearsons correlation coefficient, r = ?
sum (x-xmean)(y-ymean) / square root of(sum:(x-xmean^2) x sum:(y-ymean^2) )
how do we rank features?
In order of the absolute value of the correlation coefficient
what type of correlation does pearsons measure?
linear
what is entropy?
the reduction in uncertainty
what is information gain?
it quantifies the reduction in uncertainty (entropy) of adding a new feature
information gain I(X;Y) = ?
H(Y) - H(Y|X)
another name for information gain is?
mutual information
when we are measuring the information gain of each feature, I(X,Y), we are really calculating? (hint: relation to Y)
the mutual information between the feature and the target label
what is the major advantage of information gain?
detects nonlinearities
what is the major disadvantage of information gain?
may choose redundant features, including very similar features that may be detrimental to the learning algorithm
if i use a wrapper method, with N samples and d features and use a greedy search, how many models will I have to create
(d)(d+1) / 2
why do we perform feature selection (3)
logistical - there may be too much data to process
interpretability - we may collect more data than is useful
overfitting - inclusion of too many features could mean we overfit
pros (3) and cons (2) of forward and backward selection (wrapper method)
- Impact of a feature on the ML classifier is explicit
- Each subset of features, we know exactly how well the model performs.
- Better than exhaustive search.
- No guarantee of best solution
- Need to train and evaluate lots of models