Chapter 3- Linear Models Flashcards by Anna L

what is a decision stump

single feature

threshold required to switch decision from 0 to 1 is parameter t

How well did you know this?

Not at all

Perfectly

what is the decision boundary in a decision stump

the point at which the decision switches,

the threshold, t

How well did you know this?

Not at all

Perfectly

what is the learning algorithm for a decision stump

for t varied between min(x) and max(x):
count errors
if errors is less than minErr, set as minErr and t

How well did you know this?

Not at all

Perfectly

what does linearly separable mean?

we can fit a linear model (i.e. draw a linear decision boundary) and perfectly separate the classes

How well did you know this?

Not at all

Perfectly

what is the limitation of a decision stump?

it works only on a single feature

How well did you know this?

Not at all

Perfectly

what is the discriminant function f(x)=?

(sum for all features: wjxj) - t
or in matrix notation:
wTx - t

How well did you know this?

Not at all

Perfectly

what does the discriminant function describe, geometrically?

the equation of a plane

How well did you know this?

Not at all

Perfectly

what is the gradient and y intercept of the decision boundary from the discriminant function in two dimensions?

set equal to zero
m = -(w1/w2)
c = t/w2

How well did you know this?

Not at all

Perfectly

what is the perceptron decision rule?

if f(x) > 0 then yhat=1 else 0

How well did you know this?

Not at all

Perfectly

what is the perceptron parameter update rule, with sigmoid error?

wj = wj - (lrate)(yhat - y)(xj)

How well did you know this?

Not at all

Perfectly

what is the perceptron learning algorithm?

for each training sample:
update weight: wj = wj - (lrate)(yhat - y)(xj)
t = t + lrate(yhat-y)
until changes to parameters are zero

How well did you know this?

Not at all

Perfectly

what is learning rate?

the step size of the update

How well did you know this?

Not at all

Perfectly

what is the limitation of the perceptron algorithm?

can only solve linearly separable problems

How well did you know this?

Not at all

Perfectly

if …. the perceptron algorithm is guaranteed to solve the problem

the data is linearly separable

How well did you know this?

Not at all

Perfectly

what is the perceptron convergence theorem?

If a dataset is linearly separable, the perceptron learning algorithm will converge to a perfect classification within a finite number of training steps

How well did you know this?

Not at all

Perfectly

a logistic regression model has the output f(x) = ?

1 / 1+e^-z

where z is wT - t

How well did you know this?

Not at all

Perfectly

what is the name of the function that logistic regression uses?

sigmoid

How well did you know this?

Not at all

Perfectly

what is the decision rule for logistic regression?

if f(x) >0.5 then 1 else 0

How well did you know this?

Not at all

Perfectly

what is loss?

the cost incurred by a model for a prediction it makes

How well did you know this?

Not at all

Perfectly

what loss function does logistic regression use?

log loss, or cross-entropy

How well did you know this?

Not at all

Perfectly

what is the equation for log loss (cross entropy), L(f(x),y) = ?

L(f(x),y) = -{ylogf(x) + (1-y)log(1-f(x))}

How well did you know this?

Not at all

Perfectly

what is an error function?

when the loss function is summed or averaged over all data points

How well did you know this?

Not at all

Perfectly

what is the error function (summed log loss) for logistic regression E=?

Study These Flashcards

(sum for each i) {yilog(f(xi)) + (1-yi)log(1-f(xi))}

what are the names the error function for logistic regression is known by?

Study These Flashcards

cross entropy error

negative log likelihood

what is the rule of gradient descent, in words?

in order to decrease error, we should update parameters in the direction of the negative gradient

what is the partial derivative of the cross entropy error function with the logistic regression model, with respect to parameter wj, dE / dwj = ?

dE/df(x) x df(x)/dz x dz/dwj = sum of i: (f(xi) - yi)xij

what is the algorithm for gradient descent?

repeat: for each parameter j do wj = wj - lrate x dE/dwj until termination criteria met

what is stochastic gradient descent?

compute the gradient for each example one by one and modify the parameters for each

why is stochastic gradient descent often applied?

it works well more effectively in very large datasets

what is the algorithm for logistic regression?

t = random w = random vector set max epochs lrate = 0.1 ``` for each epoch: for each training example x: for each parameter j wj = wj - lrate(f(x)-y)(xj) t = t + lrate(f(x)-y) ```

what loss function does the perceptron use?

hinge loss

give the equation for hinge loss

sum: -y(wx + b) = sum: -y(yhat) sum all the negative values for ONLY the misclassified samples

stochastic gradient descent is also known as

mini batch

gradient based optimisation is possible when the loss function is

differentiable

what are the 4 steps of gradient based minimisation?

1. test for convergence 2. compute search direction 3. compute step length 4. update the variables

when we perform minibatch sgd, what do we times sum:dL/dW by to scale it

n / |S| | n samples / batch size

modern machine learning has given rise to what kind of programming

differentiable programming

what is differentiable programming

If the performance of a computer program can be represented by a loss function, we could seek to optimise that program via its parameters using a gradient based approach

the perceptron algorithm is a ... ... classification algorithm

deterministic | binary

what is a generative process

describes the way in which data is generated

what is the perceptron weight update, with hinge loss?

wj = wj - (lrate)( - yhat x y)(xj) or if just for the misclassified wj = wj - (lrate)( -y)(xj) = wj + (lrate)(y)(xj)

we make the iid assumption for logistic regression, this is that

our data are independent and identically distributed (iid).

the iid assumption means that the outputs ...

The outputs do not depend on multiple inputs nor on other outputs.

the iid assumption we make for logistic regression means

we can perform maximum likelihood estimation i.e. we can work out the best parameters from the data by maximising p( W | D) = multiply:p(y | x, w)

what is the loss function (negative log-likelihood) for SGD for logistic regression

- 1/n sumi->n:[yi log f(xi) + (1-yi) log (1-f(xi))] same but with 1/n to rescale based on sample size

we can use logistic regression to work out p(y=1 | x, w) =

f(x) = 1 / (1 + e^-z)

the decision boundary for logistic regression is given by

d = 1 / (1+e^-z) wx + b = log(d / 1-d)

what are the 3 data properties that will cause practical challenges for a logistic regression model

imbalanced data - anything using MLE will try to fit the dominant class multicollinearity - two or more predictor variables are highly linearly related. completely separated training data

what step can we take to minimise the impact of multicollinearity in logistic regression

feature selection

benefits of logistic regression (5)

* Efficient and straightforward, * Doesn’t require large computation, * Easy to implement, easily interpretable * Used widely by data analyst and scientist. * Provides a probability for predictions and observations.

limitations of logistic regression (2 general, 3 data properties)

* Linear decision boundaries * Inability to handle complex inputs (e.g. an image) * Multicollinearity (correlated inputs) * Sparseness (lots of zero or identical inputs) * Complete separation (it is not a probabilistic problem!)

limitations of perceptron (4)

Challenges with high dimensional multiple correlated input features linear Convergence can be tricky depending on the variant of perceptron used deterministic

which algorithm: perceptron or logistic regression, doesnt converge

logistic regression

why does logistic regression never converge

we can never reach the true decision boundary. We are trying to fit an s form to a straight boundary. Eventually we get w1=inf. This is the closest we will get.

what property of logarithm means we can take the log of the likelihood

logarithm is a monotonically increasing function. It doesnt affect where out max/min is

what is a monotonically increasing function, what does it mean?

if the value on the x-axis increases, the value on the y-axis also increases

Chapter 3- Linear Models Flashcards

(56 cards)