what is the goal in individual level models?
what is the goal in population level models?
> generate model that predicts unseen data of that individual well
1. predict unseen user
2. predict unseen data of known user
what is the limit of a perceptron?
it can ony represent linearly separable cases for classification
> classes need to be separable using a hyperplane in the p-dimensional input space
multilayer perceptron: can we always find a combination of hyperplanes that completely separate the training data without errors?
> why/why not?
yes, with one requirement:
> there may not be any input vectors that are identical but have different labels
how do convolutional neural networks work?
like multi layer neural networks but additional layers preceding the conventional layers, which identify features
two types of layers:
1. convolutional layers
> contain filters that extract features from receptive field
2. pooling layers
> summarize values in a certain region of space and represent output as single neuron
difference in class separation between neural nets and SVM?
neural nets: aim to find hyperplane that separates classes
SVM: aim to find a hyperplane that maximizes the distance between two classes
what are support vectors?
SVM describes 3 hyperplanes:
1. hyperplane that maximizes the distance between the classes
this hyperplane is moved both directions towards both classes until the first point of the classes lie on that hyperplane
> the points positioned on those hyperplanes are called support vectors
how does SVM handle the linear separability problem?
using kernel functions:
> map inputs to a higher dimensional feature space
>>> problem is linearly separable
why is k-nearest neighbour also called the lazy learner?
KNN only starts computing when it encounters a new case
what are the general steps in KNN?
1. for the new datapoint, consider the k closest points
2. assign a target value based on the target value of the k neighbours
> classification: majority class
> regression: average
what is the basic idea of distance weighted nearest neighbour?
KNN does not consider the distance of the k neighbours
> intuitively we want to weigh closer points as more important as they are more similar
what are the general steps of decision trees?
1. start with empty tree
2. select most important attribute, and create node
3. create branches for each possible value of that attribute
(for each branch a new decision tree is formed based on the subset of the training set that contains the associated attribute-value combination)
4. recursively repeat until stop condition is met
decision trees: explain entropy
> the amount of information we need to communicate to describe a seriesi.e.
> if all instances are the same, we only need to send minimal information >>> entropy 0
> if instances are evenly distributed between classes, we need to send maximum information (half of the instances) in order to describe the whole set >>> entropy 1
how to make split decisions in decision trees?
the goal: leaves that cover a set of instances of the same class (low entropy)
1. we start with whole dataset, which has high entropy
2. we want to select attributes for our nodes that split the data into subsets with the lowest possible entropy
3. decide bases on information gain:
> compare original entropy before split with weighted entropy of the subsets after split
NB: what is the prior probability?
> probability of observing class g within dataset
> easy: number of observations of class g divided by N
what is the naive assumption of naive bayes?
NB: in order to calculate the probability of observing x given class g we multiply the conditional probabilities of individual attributes
> this assumes conditional independence between attributes
> does not always hold, as attributes might be correlated
whats an advantage of NB?
> missing values can simply be ignored because their class conditional probability gives a non-zero value for cases with no observations
what are two main causes for weak prediction performance?
1. expressive power of the models might be insufficient
2. training data is too limited
>>> solution: ensembles!
what are the two main ensemble methods?
bagging: aim at reducing variance
boosting: aim at reducing bias
explain how bagging works
bagging: bootstrap aggregation
> draw multiple samples from the dataset (with replacement)
> generate a model for each sample
> aggregate output over all models, e.g. majority vote (mean in case of regression)
what does bagging avoid?
explain the major steps of boosting
boosting: iteratively create models that focus on areas where mistakes are being made
1. build initial model on whole dataset
2. evaluate performance and form a new training set X2 which weights cases where the initial model made mistakes more heavily
3. repeat m times
4. now we have m models, from which we can aggregate our predictions
data stream mining: two different solutions
1. data based solutions
> building models on a subset of the full dataset
2. task based solutions
> focus on changing the algorithm to make it more efficient
name 3 feature selection approaches
1. pearsons correlation
2. forward selection
> start with empty set of attributes and iteratively add the attribute that increases performance most
3. backward selection
> start with set of all attributes and iteratively remove the attribute that decreases performance least