Chapter 3- Linear Models Flashcards

(56 cards)

1
Q

what is a decision stump

A

single feature

threshold required to switch decision from 0 to 1 is parameter t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the decision boundary in a decision stump

A

the point at which the decision switches,

the threshold, t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is the learning algorithm for a decision stump

A

for t varied between min(x) and max(x):
count errors
if errors is less than minErr, set as minErr and t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what does linearly separable mean?

A

we can fit a linear model (i.e. draw a linear decision boundary) and perfectly separate the classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is the limitation of a decision stump?

A

it works only on a single feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is the discriminant function f(x)=?

A

(sum for all features: wjxj) - t
or in matrix notation:
wTx - t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what does the discriminant function describe, geometrically?

A

the equation of a plane

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is the gradient and y intercept of the decision boundary from the discriminant function in two dimensions?

A

set equal to zero
m = -(w1/w2)
c = t/w2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is the perceptron decision rule?

A

if f(x) > 0 then yhat=1 else 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is the perceptron parameter update rule, with sigmoid error?

A

wj = wj - (lrate)(yhat - y)(xj)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the perceptron learning algorithm?

A

for each training sample:
update weight: wj = wj - (lrate)(yhat - y)(xj)
t = t + lrate(yhat-y)
until changes to parameters are zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is learning rate?

A

the step size of the update

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is the limitation of the perceptron algorithm?

A

can only solve linearly separable problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

if …. the perceptron algorithm is guaranteed to solve the problem

A

the data is linearly separable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is the perceptron convergence theorem?

A

If a dataset is linearly separable, the perceptron learning algorithm will converge to a perfect classification within a finite number of training steps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

a logistic regression model has the output f(x) = ?

A

1 / 1+e^-z

where z is wT - t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is the name of the function that logistic regression uses?

A

sigmoid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what is the decision rule for logistic regression?

A

if f(x) >0.5 then 1 else 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

what is loss?

A

the cost incurred by a model for a prediction it makes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what loss function does logistic regression use?

A

log loss, or cross-entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what is the equation for log loss (cross entropy), L(f(x),y) = ?

A

L(f(x),y) = -{ylogf(x) + (1-y)log(1-f(x))}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what is an error function?

A

when the loss function is summed or averaged over all data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what is the error function (summed log loss) for logistic regression E=?

A
  • (sum for each i) {yilog(f(xi)) + (1-yi)log(1-f(xi))}
24
Q

what are the names the error function for logistic regression is known by?

A

cross entropy error

negative log likelihood

25
what is the rule of gradient descent, in words?
in order to decrease error, we should update parameters in the direction of the negative gradient
26
what is the partial derivative of the cross entropy error function with the logistic regression model, with respect to parameter wj, dE / dwj = ?
dE/df(x) x df(x)/dz x dz/dwj = sum of i: (f(xi) - yi)xij
27
what is the algorithm for gradient descent?
repeat: for each parameter j do wj = wj - lrate x dE/dwj until termination criteria met
28
what is stochastic gradient descent?
compute the gradient for each example one by one and modify the parameters for each
29
why is stochastic gradient descent often applied?
it works well more effectively in very large datasets
30
what is the algorithm for logistic regression?
t = random w = random vector set max epochs lrate = 0.1 ``` for each epoch: for each training example x: for each parameter j wj = wj - lrate(f(x)-y)(xj) t = t + lrate(f(x)-y) ```
31
what loss function does the perceptron use?
hinge loss
32
give the equation for hinge loss
sum: -y(wx + b) = sum: -y(yhat) sum all the negative values for ONLY the misclassified samples
33
stochastic gradient descent is also known as
mini batch
34
gradient based optimisation is possible when the loss function is
differentiable
35
what are the 4 steps of gradient based minimisation?
1. test for convergence 2. compute search direction 3. compute step length 4. update the variables
36
when we perform minibatch sgd, what do we times sum:dL/dW by to scale it
n / |S| | n samples / batch size
37
modern machine learning has given rise to what kind of programming
differentiable programming
38
what is differentiable programming
If the performance of a computer program can be represented by a loss function, we could seek to optimise that program via its parameters using a gradient based approach
39
the perceptron algorithm is a ... ... classification algorithm
deterministic | binary
40
what is a generative process
describes the way in which data is generated
41
what is the perceptron weight update, with hinge loss?
wj = wj - (lrate)( - yhat x y)(xj) or if just for the misclassified wj = wj - (lrate)( -y)(xj) = wj + (lrate)(y)(xj)
42
we make the iid assumption for logistic regression, this is that
our data are independent and identically distributed (iid).
43
the iid assumption means that the outputs ...
The outputs do not depend on multiple inputs nor on other outputs.
44
the iid assumption we make for logistic regression means
we can perform maximum likelihood estimation i.e. we can work out the best parameters from the data by maximising p( W | D) = multiply:p(y | x, w)
45
what is the loss function (negative log-likelihood) for SGD for logistic regression
- 1/n sumi->n:[yi log f(xi) + (1-yi) log (1-f(xi))] same but with 1/n to rescale based on sample size
46
we can use logistic regression to work out p(y=1 | x, w) =
f(x) = 1 / (1 + e^-z)
47
the decision boundary for logistic regression is given by
d = 1 / (1+e^-z) wx + b = log(d / 1-d)
48
what are the 3 data properties that will cause practical challenges for a logistic regression model
imbalanced data - anything using MLE will try to fit the dominant class multicollinearity - two or more predictor variables are highly linearly related. completely separated training data
49
what step can we take to minimise the impact of multicollinearity in logistic regression
feature selection
50
benefits of logistic regression (5)
* Efficient and straightforward, * Doesn’t require large computation, * Easy to implement, easily interpretable * Used widely by data analyst and scientist. * Provides a probability for predictions and observations.
51
limitations of logistic regression (2 general, 3 data properties)
* Linear decision boundaries * Inability to handle complex inputs (e.g. an image) * Multicollinearity (correlated inputs) * Sparseness (lots of zero or identical inputs) * Complete separation (it is not a probabilistic problem!)
52
limitations of perceptron (4)
Challenges with high dimensional multiple correlated input features linear Convergence can be tricky depending on the variant of perceptron used deterministic
53
which algorithm: perceptron or logistic regression, doesnt converge
logistic regression
54
why does logistic regression never converge
we can never reach the true decision boundary. We are trying to fit an s form to a straight boundary. Eventually we get w1=inf. This is the closest we will get.
55
what property of logarithm means we can take the log of the likelihood
logarithm is a monotonically increasing function. It doesnt affect where out max/min is
56
what is a monotonically increasing function, what does it mean?
if the value on the x-axis increases, the value on the y-axis also increases