Final Exam Past Exams Flashcards

Question

What is overfitting?

Answer 1

Overfitting is where the model is too complex and fits the training data too well. This means its captured the data too well (noise/outliers) and can't generalize well.

Answer 2

Backpropogation uses gradient descent to optimise weights. The gradient descent algorithm is generally very slow because it requires small learning rates for stable learning.

Answer 3

A number multiplied to the weight (\<1) to stop the weight growing too large in a neural network. This is also prevent the neural network overfitting.

Answer 4

Test error is the highest priority as it defines how well the model generalizes

Answer 5

Weight decay is a regularization factor that penalizes a model overfitting. This leads to more simple disciminant functions for categorising data.

Answer 6

Margin is the maximum distance between support vectors. By maximising this distance and using it to classify data, the probability that a new data point is classified correctly is increased.

Answer 7

Prior probability can be used to determine posterior probability and update a hypothesis.

Answer 8

Figure 9(a) represents the prior distribution, which is our belief of what the model may be before any observations are made. Figure 9(b) represents the posterior distribution, so our belief of what the model may be based on the prior and the observed data. In figure 9(b) the shaded area shrinks at each observed data point since we know the actual value of the model at those points, but as we move away from the observed values the possible model values grows and so the distribution (shaded area) also grows.

Answer 9

Regression is used for predicting continuous variables. Classification is used for predicting categorical variables.

Answer 10

Likelihood approaches make assumptions regarding the distribution of the data. The goal is to use bayes rule and model the posterior distribution of the classes given the training data. A test data point is assigned a class based on the highest posterior probability. Discriminant approaches make no assumptions about the data and instead try and separate the classes of the data with a boundary(the discriminant). This is accomplished using some distance measure and by placing a hyperplane between the classes which allows classification if the point is within the boundary.

Answer 11

The FP, TP, FN and TN of a dataset.

Answer 12

Build a model by adding features that have highest significance first. If the E value improves with adding the next feature, the next most significant feature is added next.

Answer 13

Check each column to see which row has two zeroes and a 1. Look for the smallest value of these. It is x2 in this case. Check the combinations with x2, so (x1, x2) and (x2, x3). If lower than the current e value. It's the new case. In this example, (x1, x2) \< x2 -\> new case. Then check the next combination (x1, x2, x3) does it have a lower value than (x1, x2). No, so x1 and x2 are the features that would be selected in the algorithm.

Answer 14

It's a greedy algorithm. It doesn't consider the best subset overall. If (x2, x3) are final with forward selection, it could have ignored (x1, x3).

Answer 15

Gi the components/groups/clusters P(Gi) priors P( x | Gi) the component densities P(x) the mixture model of the given data k is the number of clusters

Answer 16

The parameters of a GMM are the k parameter, the covariance matrix, initial conditions and the regularization parameter. The k parameter is the number of clusters

Answer 17

E: Compute the expected value of your “hidden variables”, based on the current values of the parameters M: Recompute the most likely value of your parameters based on the value of the hidden values and the observed data.

Answer 18

As the EM is started on random initial conditions there is a chance it will converge to a local optima. This is what can be seen in the results as the solution is non-convex and does not guarantee a global optimum

Answer 19

A kernel density estimator is used to estimate an unknown probability distribution. Similar to a histogram. You place a kernel function on each data point and you sum them together. This gives a smooth distribution rather than a histogram model which is dependent on bin size.

Answer 20

The RHS is the posterior estimate which is the discriminant. The PDE is used as the prior in Bayes Rule.

Answer 21

For each layer add 1 and multiply by the next layer to the right. So [784+1] \* 50, then [50+1] \*50, [50+1]\*20 ... until the last layer.

Answer 22

Combing features in the encoding section of the autoencoder neural network.

Answer 23

Regularisation is a technique which is vaguely based on Occams razor and restricts models from being overly complex and potentially overfitting

Answer 24

i) Weight decay stops the model overfitting and increasing complexity. As more hidden layers are added, the weight decay reduces the weights of each node. ii) As the number of hidden units grow the weights are decreased having less and less impact using the second term in the equation.

Answer 25

The model is overfitting when there is no weight decay, as weight decay increases this test error reduces. By using weight decay it is finding a balance between bias and variance which can be seen as the variance and bias reduces to find the optimal solution. As this parameter is increased further the test error increases again as the system is being limited to a simple system and mildly under fits the data.

Answer 26

Bagging is parallel, Boosting is iterative, using misclassified elements from a previous bag to improve the model. Remove anything except sampling and training. In the test pseudocode, anything below "calculate" until "test".

Answer 27

A kernel function allows non-linear data to be mapped in a higher dimensional space so it can be separated more easily.

Answer 28

The pdf’s for the likelihood and prior as well as the parameters of the prior.

Answer 29

MLE is maximum likelihood estimation, a technique to find parameters which maximize the likelihood (not posterior probability) of data. It does not take prior probability into account likes Bayes does..

Answer 30

The data is split into 10 subsets where each fold is run with 10 - 1 (9) subsets of the data. Each time the data is rotated until all subsets of data have been left out once. During each run the error values are recorded and the mean and standard deviation is calculated from these outputs to be shown on the figure.

Answer 31

Larger K means less bias but higher variance and higher running time. Small k means more bias but less variance

Answer 32

The way t-SNE works is that it computes the pairwise likelihoods of generating data-points in the high dimensional space (P), and then tries to find an embedding that minimizes the KL divergence once you compute the pairwise likelihoods of generating data-points in the low dimensional space (Q).

Answer 33

#Experiments =5\*5\*5\*5\*2=1250 5 for each parameter and 2 for the last.

Answer 34

Quadratic Discriminant - Supervised Multidimensional scaling - Unsupervised Fisher’s Linear discriminant analysis - supervised Gaussian mixture models - unsupervised

Answer 35

P(A,B,R,P)= P(B) \*P(A|B) \*P(R|B) \*P(P|A,R) Probability of Boring \* Probability (attend lecture given it was boring) \* Probability(Revise given it was boring) \* Probability (Panic given you attend lectures and revised)

Answer 36

t-sne uses gradient descent

Answer 37

t-SNE aims to represent as much of the structure in the data as possible in a lower dimension map (that can be represented as a scatterplot). It focuses particularly on retain the local and global structures in the data, which many other techniques struggle with

Answer 38

C. C for cost

Answer 39

Large learning rate means that gradient descent doesn't converge well. Can't get useful information for plotting.

Answer 40

The size of the window and the mean of the window determine how mean shift converges.

Answer 41

Deep neural networks include implicit feature-extraction, the complexity of these feature-extraction can increase with respect to number of neurons and layers in the network, creating increasingly abstract representations of the underlying structure in the data.

Answer 42

The element on the right points to the element on left so P(B|A) means A-\>B. A will be the root node of this Bayes network.

Answer 43

2 for each P(x|y) 1 for P(z) So (2 x 7) + 1 = 15

Answer 44

of variables = 8 (A-H) full joint distribution = 2^8

Answer 45

Decision tree

Answer 46

Ensemble learning works by producing a distribution of simple ML models on subsets of the original data. The varying simple ML models are then combined into one “aggregated” model.

Answer 47

Bagging: reduces the variance the more models being added. Boosting: Boosting reduces the bias of the overall model by focusing on previously misclassified data. This is done by increasing the chance that a previously misclassified point is resampled and is appropriately classified.

Answer 48

Shaded region is how confident the model is around a particular input, x. This is the posterior output where every subset of x is represented by multivariate gaussian distribution. GP is a method of regression given a test input the model will output a multivariate gaussian distribution for that point which will give insight into if that point.

Answer 49

Before figure 6 the model is represented over the function as a prior and following the observation of some data figure 6 is generated which is the posterior. As more data is seen the confidence of that point is increased which results in a narrower shaded region at that point. The covariance function varies how the function space responds to the length of the characteristic length-scale. I.e. it is a smoother function or a more noisy function.

Answer 50

The number of data points

Answer 51

The number of features

Answer 52

Online version: May escape from local minima, and converges faster in gradient descent.

Answer 53

The n variable. if the learning rate is too high then the algorithm may not converge to the local optima.

Answer 54

The wth value.

Answer 55

Change 6,7,8 to y\_i = sigmoid(v^T\_i z)

Answer 56

online (stochastic because random order), the update in weight of output and hidden layer is done for each datapoint instead of whole iteration of the dataset.

Answer 57

Using the Rectified Linear Unit Activation function

Answer 58

Convolution layers =\> adaptive filters, Fully Connected =\> adaptive weights, this is all.

Answer 59

iii) Randomly discard a percentage number of nodes.

Answer 60

The covariance function specifies the prior in a GP. Hyper parameters are parameters of the covariance function. So the hyperparameters change the prior. This describes the types of functions we expect to see before observing any dat

Answer 61

Training set error - biggest difference between GP mean (line) and data. Test error - Probably the last one will have worse test error (overfitting), highly non-smooth.

Final Exam Past Exams Flashcards

(99 cards)