Topic 1: Loss functions Flashcards

1
Q

What is a target

A

The output y of a function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the basic model of the function of sample

A

y = ftrue(x) + ϵ

ϵ = noise, modelled by a gaussian distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is k-nn regression

A

When a new x value is observed, to calculate y
Take the k nearest neighbours to x and average their y values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is an instance-based algorithm

A

Not good at generalising beyond the current scenario
i.e. k-nn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is “fine-tuned” the same as

A

overfitting = complex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 6 main types of function approximations

A

linear/polynomial regression
support vector machines
neural networks (CNN and logistic regression)
naive bayes (probabilistic models)
decision trees (for both regression and classification)
ensemble models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do the 6 main types of function approximations work

A

By minimising loss functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the properites of overfitting

A

high accuracy on training data
captures noise
high testing errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the properties of underfitting

A

too simple
high training and testing errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does x ∈ R^d mean

A

x is a real-valued feature vector of length d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the common loss function used for classification problems

A

Cross entropy loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the formula for binary CE loss

A

l(y, f(x)) = -[ylnf(x) + (1-y)ln(1-f(x))]

where y ∈ {0,1} and f(x) ∈ (0,1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the common loss function used for regression problems

A

Squared loss function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the formula for squared loss function

A

l(y, f(x)) = (y - f(x))^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Where does the true label y always go

A

First before the function label f(x)
eg ( y - f(x))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the formula for squared loss training error

A

ltrain(f) = 1/n Σ (yi - f(xi))^2

17
Q

What does asterisk denote

A

Optimal solution or best value

18
Q

What is the general equation for w*

A

argmin(w) [ltrain(f)]

19
Q

What is a probability simplex

A

A geometric object that represents all possible probability distributions over a finite set of outcomes
Eg for classification problem with 3 classes
a traingle where each vertex corresponds to a class

20
Q

What does ∈ (0,1) mean

A

The variable can take any value in this interval but not 0 and 1 themselves

21
Q

what does ∈ {0,1} mean

A

The variable can take any value in this interval including 0 and 1 themselves

22
Q

What is the purpose of gradient descent

A

To optimise complex models

23
Q

Theoretically what parameter would we like to adjust to achieve global minimum

A

wj
too computationally expensive to manually plot
instead we use iterative methdods

24
Q

What is the key principle of gradient descent when gradient is negative

A

increase the parameter wj

25
Q

What is the key principle of gradient descent when gradient is positive

A

decrease the parameter wj

26
Q

What does nabla ∇ denote

A

gradient

27
Q

What is the update rule

A

w ← w - η . ∇l(y, f(x))

28
Q

What is η in the update rule

A

The learning rate (step size of algorithm)

29
Q

What is full batch gradient descent

A

The entire dataset is used to compute the loss function gradient
w ← w - η . 1/n Σ [∇l(yi, f(xi))]

30
Q

What is mini batch gradient descent

A

Randomly picks a sample S of m datapoints
w ← w - η . 1/m Σ [∇l(yi, f(xi))]

31
Q

What is stochastic gradient descent

A

At the extreme of mini batch where m=1
(sometimes referred to with values m>1)

32
Q

What does SGD help with

A

Tends to help with escaping local minima

33
Q

What is sgd, gd and mini batch gd all examples of

A

first order gradient descent algorithms

34
Q

What are first order vs second order gd algorithms

A

second order use the gradient but also information about the curvature of the loss function (second derivative)

35
Q

When and why are decision trees useful

A

Fast to train and deploy
good for tabular data (anything that fits a spreadsheet)
not images, speech, videos

36
Q

How do decision trees work

A

Recursively splits the data into subsets based on values of input features
Then fits each subsets to a simple model (constant label/ linear regression)
When a new label x is observed, traverse the tree to find the right prediction

37
Q

How do classification and regresssion trees compare

A

Use the same branching
eg X1 > 0.83 (yes or no)
then X4 < 0.3 … etc

Except regression trees predict a continuous value at the final node - usually the mean of the target variable within that node
for classification the node is a class eg class 3

38
Q

What is the main parameter of decision trees

A

Depth
similar to increasing k for knn
increased depth -> increased complexity

39
Q

what is thew update rule updating/optimising

A

parameters w
Determines how the parameters should be adjusted in order to minimise the loss function