Final Exam Past Exams Flashcards
(99 cards)
How is Occam’s Razor applied to Machine Learning?
If you have two comparable machine learning models, the simplest is the better.
How many parameters does this model have?

d + 1
What is the difference between feature selection and feature extraction?
Feature selection is the selection of a subset of the features for building a model.
Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features
Describe PCA
PCA is the projection of the principle components onto a lower dimensional space. These principle components are the features that have the highest share of the variance.
What dimensionality reduction technique works the best?

LDA (red) works the best because it has the best separability.
How does the k-means clustering algorithm work and what is the “solution” that it produces?
k-means works by initialising randomly k-centroids. Each data point is assigned to the nearest centroid. At each stage of the algorithm, the centroid locations are updated to be the mean of all the data points. Data points are assigned to their local centroid. This continues, until their is convergence.
How would you assess if k-means clustering has worked properly?
If the centroids don’t move.
How would you assess if k-means has converged?
If all the data points are assigned to the same cluster in successive iterations, there’s convergence
How do you decide how many base learners when using bagging?
The number that can reduce the variance is an optimum number.
What is the misclassification error of this dataset?

All non-diagonal elements divided by the total
Explain these models in terms of overfitting/underfitting.
Top left - degree 1
Top right - degree 2
Bottom left - degree 10
Bottom right - degree 25

The top left model is underfitting because no matter how much training data is added, it’s performance isn’t increasing.
The bottom models are overfitting because the test error is significantly higher than the training error.
What’s the purpose of the validation set?
The purpose of the validation set is to use a set of examples used to tune the parameters of a classifier
One commonly used learning algorithm for linear discriminant models and MLP is Gradient Descent. What’s the basic idea behind gradient descent?
Find function parameters (coefficients) that minimize a cost function as far as possible.
In MLP, why are sigmoid functions used instead of hard-step functions?
Hard-step aren’t continuous, sigmoid are.
It is especially useful for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is used instead of hard-step functions.
In MLP what is the role of weight and bias?
A weight represent the strength of the connection between units. Decides how much influence the input node has on the output.
A bias ensures there’s always an activation in a node, even if the weight is zero. Makes a more flexible mlp model.
Bayesian inference is a general alternative to maximum likelihood estimation that can be used to train a variety of models given data. Explain the main idea of Bayesian inference and compare with MLE. Your answer should mention the prior and posterior distributions over model parameters.
Bayesian estimation takes into account prior probability when assigning the likely parameters of a model.
MLE just tries to estimate the parameter which maximizes the likelihood function.
If d= 10, how many parameters would a six degree polynomial have compared to the linear model?

61
What is a hyper parameter in the context of Bayesian inference? Give an example.
The prior and the likelihoods, as well as the parameters of the prior distribution are all hyper parameters in the context of Bayesian Inference.
In machine learning, what is known as “generalization”?
Generalization is how well a trained model accurately classifies new data. An overfit model, doesn’t generalize well.
You are given a 5 dimensional dataset. After doing PCA, you discover that the 4th and 5th features have zero eignenvalues. What should you do?
They can be removed as they don’t contribute to the variance.
What’s an expression for the percentage of the variance captured by the first principle component, where the eigenvalues and the covariance matrix of the data are lambda1 and lambda2?
lamda1 / lambda1 + lambda2
What is the total number of data points in this training set? How?

Sum all rows and columns.
24
How many data points do we have in each class?

Sum the rows:
A: 5
B: 4
C: 15
What is the sum of the diagonal values in a confusion matrix?
The accurately predicted data points.





























