Intro Flashcards

(86 cards)

1
Q

Why is machine learning popular?

A

-Lots of data available
-current control theory methods struggle to solve large scale complex problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the types of supervised learning?

A

regression and classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are the types of unsupervised learning?

A

clustering and dimensionality reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are the types of reinforcement learning?

A

Value iteration and policy iteration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is supervised learning?

A

a function that maps an input to an output based on labelled example input output pairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is unsupervised learning?

A

an algorithm that learns patterns from un labelled data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the key difference between regression and classification?

A

in regression the data is continuous whereas discrete data is used for classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does regression work?

A

find a function that minimises a cost function (most often mean squared error)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe a nearest neighbour model?

A

individual data point is grouped depending on proximity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe a piecewise linear model

A

data follows different linear trends over different regions of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some model types?

A

Linear, low order polynomial, high order polynomial, piecewise linear, nearest neighbour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When does overfitting occur?

A
  • when a model fits the data set too well and is unable to generalise
  • low density of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a characteristic of overfitting?

A

oversensitivity to measurement noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can overfitting be avoided?

A

do not use a model that is more complicated than required (Occam’s razor)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a white box model?

A

-increased system information
-low model uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a black box model?

A

-decreased system information
-high model uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is inference?

A

the process in which prediction is made

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does the expected mean square error of the prediction depend on?

A

bias and variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

what is meant by high bias?

A

model fails to capture the underlying structure of the data (underfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what is meant by high variance?

A

model is sensitive to small fluctuations in the data (overfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

when is variance high?

A

in complex models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what is the bias-variance trade-off?

A

If biased is increased then variance decreases and vice versa. Therefore need to minimise both bias and variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what is meant by error?

A

the error between the true value and the predicted value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

what happens in simple linear regression?

A

identify a line of best fit y=a0+a1x+err, where a0 and a1 need to be determined

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
How do you find a0 and a1 that minimises the sums of square errors of residuals?
-find stationary points by taking partial derivatives of sum of square residuals -set to zero to solve optimization problem
26
What is the ordinary least squares (OLS) method?
approximately select X, Y an theta, and solve theta = (XT X)^-1(XT Y)
27
What is the X matrix known as?
design matrix regressor matrix
28
When does the OLS method not work for linear regression?
- If XTX is not invertible, it cannot be solved - not invertible if the the OLS problem has non-unique solutions
29
What is meant by collinearity?
two sequences of data are said to be colinear if there exists k not=0 such that x1i=kx2i
30
What occurs in the OLS if there is a pair of feature data sequences that are collinear?
the associated OLS has infinite optimal solutions
31
when does collinearity occur?
when two feature variables are highly correlated providing redundant information
32
How can you deal with collinearity in data?
- increase the amount of training data - find and remove highly correlated data
33
What are some issues with the OLS method?
- computing inverse of XTX can be computationally expensive - if the data is close to being collinear then the OLS solution becomes very sensitive to small changes in the training data set
34
Which models can be fit using the OLS method?
those linear in parameters
35
How can the goodness of fit of a regression model be assessed?
Using R^2 coefficient
36
what does a R^2 value approximately equal to 1 indicate?
sum of squared error is small, therefore a good model
37
What does a negative R^2 value mean?
very bad model
38
what does a small positive R^2 value indicate?
bad model
39
What is the Weierstrass Approximation Theorem?
For any continuous function f, with continuous interval [a b] and E>0, there exists a polynomial p, such that sup|f(x)-p(x)|
40
What is regularization used for?
- Prevent overfitting to training data - remove user choice from a model
41
Why do we use regularization?
- Increasing model parameters will fit the training data more accurately, but unnecessary terms can cause overfitting.
42
What is classification?
Supervised learning of discrete data
43
What does a Bayes classifier do differently?
It constructs a probability distribution instead of a model
44
What is a perceptron?
An algorithm that describes the classification rules for hyperplanes in supervised learning of binary classifiers. The simplest Neural Network
45
What issues arise in non-linear regressions?
no unique solution settle for approximate solutions (newton raphson method)
46
What does gradient descent do?
Identifies local minima, solving OLS
47
What are the issues with high dimension feature space?
-Hard to visualise data in large dimensions -OLS fails
48
How do you find Principal Components?
1. Compute centred data matrix X~ 2. Compute X~^T X~ 3. Find orthonormal eigenvector of X~^T X~ with the largest eigenvalue
49
What are the principal components?
-The orthonormal vector direction with the largest variation in data - The orthonormal vectors that define a linear manifold giving minimal reconstruction error - The orthonormal eigenvectors of X~^T X~ with the largest eigenvalues - q columns of W corresponding to the q largest squared singular value where singular value decomposition of the regressor matrix is X=UEW^T
50
What is clustering?
A class of unsupervised learning methods that separates data into groups by similarity
51
What does the Weierstrass Theorem do? (in simpler terms)
provides a guarantee that certain functions can be approximated to arbitrarily high accuracy by a finite degree polynomial provided the function is defined over a finite interval
52
when is a matrix non invertible?
when the determinant is equal to zero
53
What are the disadvantages of using polynomials in modelling?
- Many coefficients and parameters in high degree polynomials - No guarantee of approximating discontinuous functions e.g tan(x) - Slow convergence rates Polynomials tend towards infinity, which in unnatural system behaviour
54
How does regularization prevent overfitting?
It penalises unnecessary non-zero parameters to help prevent the model becoming oversensitive to noise in the training data
55
How do we select lambda in regularization?
randomly sample to find optimal lambda performance typically evaluated through cross-validation
56
What is the perceptron equation?
f(x)=sgn(w^Tx)
57
What is the structure of the perceptron?
x, w, sum everything, step, output
58
How do you form a non-linear decision boundary?
Add more basis functions
59
What is the equation for a support vector machine?
𝑓(𝑥;𝜃)=𝜃0 + ∑ i∈S 𝜃i 𝐾(𝑥,𝑥i) where 𝑆={indices of support vectors} and K:ℝn×ℝn→ℝ are kernel functions
60
What are the advantages of linear/logistic regression over support vector machines?
- can adjust threshold and shape TPR and FPR - Get a probabilistic interpretation
61
What are the advantages of support vector machines over logistic regression?
- good against noise far away from true decision boundary - perfectly separates data when possible
62
Why is gradient descent a useful algorithm?
(X^TX)-1 doesn't have to be computed, therefore it is computationally cheaper
63
What do convex functions have?
A unique global minima
64
What is the equation for calculating Term Frequency?
TF= number of times the term appears in text / Total number of terms in text
65
What is the equation for Inverse Document Frequency?
IDF = log 10 (Number of Documents / number of documents with term in it )
66
What is the equation for TFIDF?
TFIDF = TF x IDF
67
What does a low TFIDF indicate?
rare words
68
Why do we use PCA?
because it is computationally expensive to solve OLS for large data sets
69
What is the issue with large dimension data?
many optimal models with MSE=0. XTX will typically be noninvertible cannot use OLS
70
Why do we use unsupervised learning?
Most data sets are unlabelled it is costly to label data sets
71
What is the Singular Value decomposition of X?
X=U∑W^T U => unitary matrix where UTU=I W => unitary matrix where WTW=I ∑ => diagonal matrix with non negative elements ordered largest to smallest
72
What are the columns of W in SVD?
the eigenvalues if the XTX matrix
73
What is XTX as a SVD?
=W∑TUTU∑WT =W∑T∑WT
74
If a data point is equidistant from two cluster centres, how do you choose which one to assign it to?
The lowest one by convention
75
What is the average dissimilarity in a cluster equation?
1/ number of elements in cluster X sum of the L2 Euclidian Norm squared (sum of the squares between the cluster and data point)
76
What is the K-Means algorithm?
1. Randomly assign a number from 1 to K for each data point 2. Iterate until the cluster assignments stop changing - Compute centroid for each cluster - Update cluster assignment to closest cluster centre
77
What are the advantages of the K-Means algorithm?
More computationally efficient than brute force method
78
What are the disadvantages of the K-Means algorithm?
- must select number of clusters - Doesn't necessarily converge to optimal clusters - cannot handle non-convex clusters
79
In ARX models how do you obtain an unbiased estimate from the least squares solution?
if Psi doesn't have any noise
80
How is a AR model displayed?
AR(ny)
81
How is an ARX model displayed?
ARX(ny, nu)
82
How is an ARMAX model displayed?
ARMAX(ny, ne, nu)
83
What does the Moving Average mean in an ARMAX model?
The model is dependent on delayed error/noise.
84
What would applying OLS to an ARMAX model result in?
A biased estimation
85
How would you show ARX model is unbiased?
E(theta) = E (OLS sol) sub X Theta + e into Y ends up equal to theta *
86
What is the OLS solution with L2 regularization?
theta = (XTX + lambda I)^-1 XT Y