what is the goal in individual level models?

what is the goal in population level models?

individual level:

> generate model that predicts unseen data of that individual well

population level:

1. predict unseen user

2. predict unseen data of known user

what is the limit of a perceptron?

it can ony represent linearly separable cases for classification

> classes need to be separable using a hyperplane in the p-dimensional input space

multilayer perceptron: can we always find a combination of hyperplanes that completely separate the training data without errors?

> why/why not?

yes, with one requirement:

> there may not be any input vectors that are identical but have different labels

how do convolutional neural networks work?

like multi layer neural networks but additional layers preceding the conventional layers, which identify features

two types of layers:

1. convolutional layers

> contain filters that extract features from receptive field

2. pooling layers

> summarize values in a certain region of space and represent output as single neuron

difference in class separation between neural nets and SVM?

neural nets: aim to find hyperplane that separates classes

SVM: aim to find a hyperplane that maximizes the distance between two classes

what are support vectors?

SVM describes 3 hyperplanes:

1. hyperplane that maximizes the distance between the classes

this hyperplane is moved both directions towards both classes until the first point of the classes lie on that hyperplane

> the points positioned on those hyperplanes are called support vectors

how does SVM handle the linear separability problem?

using kernel functions:

> map inputs to a higher dimensional feature space

>>> problem is linearly separable

why is k-nearest neighbour also called the lazy learner?

KNN only starts computing when it encounters a new case

what are the general steps in KNN?

KNN:

1. for the new datapoint, consider the k closest points

2. assign a target value based on the target value of the k neighbours

> classification: majority class

> regression: average

3. done

what is the basic idea of distance weighted nearest neighbour?

KNN does not consider the distance of the k neighbours

> intuitively we want to weigh closer points as more important as they are more similar

what are the general steps of decision trees?

decision tree:

1. start with empty tree

2. select most important attribute, and create node

3. create branches for each possible value of that attribute

(for each branch a new decision tree is formed based on the subset of the training set that contains the associated attribute-value combination)

4. recursively repeat until stop condition is met

decision trees: explain entropy

entropy:

> the amount of information we need to communicate to describe a seriesi.e.

> if all instances are the same, we only need to send minimal information >>> entropy 0

> if instances are evenly distributed between classes, we need to send maximum information (half of the instances) in order to describe the whole set >>> entropy 1

how to make split decisions in decision trees?

the goal: leaves that cover a set of instances of the same class (low entropy)

1. we start with whole dataset, which has high entropy

2. we want to select attributes for our nodes that split the data into subsets with the lowest possible entropy

3. decide bases on information gain:

> compare original entropy before split with weighted entropy of the subsets after split

NB: what is the prior probability?

prior probability:

> probability of observing class g within dataset

> easy: number of observations of class g divided by N

what is the naive assumption of naive bayes?

NB: in order to calculate the probability of observing x given class g we multiply the conditional probabilities of individual attributes

> this assumes conditional independence between attributes

> does not always hold, as attributes might be correlated

whats an advantage of NB?

advantage:

> missing values can simply be ignored because their class conditional probability gives a non-zero value for cases with no observations

what are two main causes for weak prediction performance?

solution?

1. expressive power of the models might be insufficient

2. training data is too limited

>>> solution: ensembles!

what are the two main ensemble methods?

bagging: aim at reducing variance

boosting: aim at reducing bias

explain how bagging works

bagging: bootstrap aggregation

> draw multiple samples from the dataset (with replacement)

> generate a model for each sample

> aggregate output over all models, e.g. majority vote (mean in case of regression)

what does bagging avoid?

overfitting

explain the major steps of boosting

boosting: iteratively create models that focus on areas where mistakes are being made

1. build initial model on whole dataset

2. evaluate performance and form a new training set X2 which weights cases where the initial model made mistakes more heavily

3. repeat m times

4. now we have m models, from which we can aggregate our predictions

data stream mining: two different solutions

1. data based solutions

> building models on a subset of the full dataset

2. task based solutions

> focus on changing the algorithm to make it more efficient

name 3 feature selection approaches

1. pearsons correlation

2. forward selection

> start with empty set of attributes and iteratively add the attribute that increases performance most

3. backward selection

> start with set of all attributes and iteratively remove the attribute that decreases performance least