Linear Regression Flashcards

(39 cards)

1
Q

What is gradient descent?

A

an algorithm that tweaks parameters iteratively (individually) in order to minimize a cost function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is batch gradient descent?

A

instead of computing gradients individually, it computes them all in one go by using the whole training set at each iteration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is the downsides to batch gradient descent?

A

because it uses the whole set at each step, it is very slow on large sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what does a higher learning rate mean with gradient descent?

A

fails to find a good solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what does a lower learning rate mean with gradient descent?

A

takes longer to compute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how does gradient descent perform with features with different scales?

A

it takes longer to reach the minimum, making the algorithm slower

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how does gradient descent perform with features with same scales?

A

it goes directly to the minimum without jumping around, making the algorithm faster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what algorithm is better to use with Linear Regression out of -> Gradient Descent or Normal Equation when you have a larger dataset? Why?

A

Gradient Descent because it is faster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

out of gradient descent and normal equation, which is faster, why?

A

Gradient descent, because it handles instances one at a time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is a cost function?

A

?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is stochastic gradient descent?

A

as opposed to batch-gd which uses the whole set at each step, sgd picks a random instance and handles them one at a time, making it much faster and better for bigger sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

which gd algorithm is better for large sets?

A

sgd because it handles instances one at a time, instead of using the whole set at each step like bgd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what happens when you reduce the sgd’s learning rate slowly?

A

jumps around for ages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what happens when you reduce the sgd’s learning rate quickly?

A

get stuck in local minimum or frozen

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is mini-batch gradient descent?

A

computes gradient on small random set (both sgd and bgd)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is polynomial regression?

A

when the data is too complex for a straight line, you can use powers on every feature and add them as new features, then use linear regression

17
Q

what is good about polynomial regression?

A

it finds relationships between features, and can be used when a straight line wont fit the data

18
Q

what are some regularized linear regression models?

A

ridge, lasso, elastic net

19
Q

what is ridge regression?

A

a regularized version of linear regression that forces the learning algorithm to git the data, but also keep the model weights as small as possible

20
Q

what happens if the ridges a=0?

A

it is linear regression

21
Q

what happens if the ridges a=large?

A

all weights end up very close to zero, resulting in a flat line

22
Q

what is an important step before using ridge regression?

23
Q

what should you do before ridge regression?

24
Q

what is lasso regression?

A

a regularized version of linear regression that eliminates weights of least important features and automatically performs feature selection and outputs a sparse model

25
what is good about lasso?
it automatically performs feature selection
26
what is elastic net?
it is in between ridge and lasso, a mix of both
27
what happens if elastic nets r = 0?
it is ridge
28
what happens if elastic nets r =1?
it is lasso
29
when should you use ridge?
as a default for linear regression
30
when should you use lasso or elastic net?
when only few features are useful because they both try reduce useless features
31
when should you choose elastic net over lasso?
when the number of features is larger than the number of instances, because lasso may behave erratically OR when several features are correlated
32
why would you want to use ridge regression instead of linear regression?
a model with some regularization performs better than a one without, so you should use ridge as a default over linear regression
33
why would you want to use lasso instead of ridge regression?
lasso leads to a sparse model, which is a way to perform feature selection automatically, which is good for you if you suspect that only a few features actually matter. When you are not sure, you should use ridge regression.
34
why would you want to use elastic net instead of lasso?
elastic net is preferred over lasso, however lasso has a hyperparameter that you can use if you want to use it without the erratic behaviour. set it close to 1.
35
when would you choose two logistic regression classifiers over one softmax regression classifier? and vice versa
two when they are not exclusive, | softmax when they are.
36
what suffers from features having different scales?
GD algorithms
37
Can GD get stuck in a local minimum when training a logistic regression model?
no because the cost function is convex
38
what does it mean when there is a large gap the training error and the validation error?
if the validation error is much higher than the training error, it is because your model is over fitting
39
what does it mean when your validation error is much higher than your training error?
the model is overfitting