L4 Flashcards

(23 cards)

1
Q

What is the primary objective of Linear Regression?

A

Linear Regression models the relationship between features and a continuous target

Minimize the difference between predicted y and actual y

This is achieved through the adjustment of weights and bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the parameters of a Linear Regression model?

A

Weights (w) - shows how important a feature is → positive w (More of this feature, higher chance of class 1) / negative w (More of this feature, lower chance of class 1)

**Controls tilt/steepness of the S-shaped curve

bias (b)
**sets where the 50 % probability point (decision boundary) lies: y^=0.5 where wx+b=0.
Increase b (make it less negative) → curve slides left → need smaller x to reach 50 %.
Decrease b (more negative) → curve slides right → need larger x to reach 50 %.

Weights indicate the importance of features, while bias adjusts the line’s intercept.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define hyperparameters in the context of Linear Regression.

A

Variables whose value is set before the training process begins

Examples include regularization parameters and number of neighbors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a loss function?

A

Measures how far off our predictions are for a single training example

what you are trying to minimize for a single training example to achieve your objective → e.g. square loss (average y - yi)

Example: square loss (average y - yi).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the cost function in Linear Regression?

A

Average of loss functions over the entire training set, e.g., Mean Squared Error (MSE)

Find the average squared difference between what we predicted and what it really was

It quantifies the overall prediction error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is an objective function?

A

Any function that you optimize during training (e.g. maximum likelihood, divergence between classes)
**loss function is a part of a cost function which is a type of an objective function

Examples include maximum likelihood and divergence between classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does gradient descent work in Linear Regression?

A

Adjusts weights to minimize the cost function (optimization)

“slope” tells us how much we should adjust 𝑤 + 𝑏 to make the loss smaller

It uses the slope to determine adjustments needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does the learning rate control in gradient descent?

A

The size of steps taken in any direction

Controls how big each step is (too big = overshoot, too small = very slow)

Affects the speed and accuracy of convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the purpose of logistic regression?

A

To model categorical outcomes, especially in binary classification

Linear regression cannot model categorical outcomes well (e.g., pass/fail) -> Logistic regression solves this using a sigmoid (logistic) function (squashes the prediction into a range between 0 and 1)
- Assumes a particular functional form (a sigmoid) is applied to the linear function of the data

It uses a sigmoid function to output probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the output range of the logistic regression model?

A

(0, 1) -> interpretable as probability.

When 𝑧 is very big → output is close to 1 (yes)
When z is very small → output is close to 0 (no)
When 𝑧 = 0 → output = 0.5 (unsure)

**Can handle both continuous and discrete features
**Can be extended to multi-class problems

This range is interpretable as probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the decision boundary in logistic regression?

A

The point where probability = 0.5

It indicates the threshold for classifying outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

State the formula for the logistic function.

A

y = 1 / (1 + e^(-z))

Where z is the linear combination of inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does cross-entropy loss measure in logistic regression?

A

The penalty for incorrect predictions, especially when the model is confident

It emphasizes the cost of being wrong with high certainty.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define entropy in the context of probability distributions.

A

Measures uncertainty in a distribution

Max entropy = most uncertain (e.g., [0.5, 0.5])
Min entropy = most certain (e.g., [1, 0])

Max entropy indicates maximum uncertainty, while min entropy indicates certainty.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is regularization used for?

A

To prevent overfitting, especially when training data is limited

regularization is applied by adding a penalty term to the loss function → This penalty makes it “costly” for the model to have very large weights.

  • Prevents the model from being too sensitive to training data (prevents overfitting).
  • Encourages simpler models that are more likely to perform well on new data.
  • Helps especially when you have many features or limited data.

It adds a penalty term to the loss function to control weight sizes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the two types of regularization mentioned?

A

L1 (Lasso) and L2 (Ridge)

L1 encourages sparsity; L2 discourages large weights.

17
Q

What does L1 regularization do?

A

Encourages some weights to become exactly zero

  • can be used for feature selection — less important features are “removed” by making their weights zero

WHILE L2
- Penalizes large weights, encouraging all weights to be small but not exactly zero.
- Helps smooth the model and prevent it from fitting noise.

This can be useful for feature selection.

18
Q

Describe the One-vs-Rest (OvR) approach in multi-class logistic regression.

A

Train a classifier for each class vs all others

How it works:
For KK classes, train K separate binary classifiers.
Each classifier tries to predict if the sample is in class kk or not.
Prediction:
For a new input, run all K classifiers and choose the class with the highest predicted probability.

Example:
For classes A, B, C:
Classifier 1: A vs not A
Classifier 2: B vs not B
Classifier 3: C vs not C

Each classifier predicts membership in its respective class.

19
Q

What is the One-vs-One (OvO) approach?

A

Train a classifier for each pair of classes

How it works:
For KK classes, train a separate classifier for each pair of classes (total of K(K−1)/2K(K-1)/2 classifiers).
Each classifier tries to tell the difference between two classes at a time.
Prediction:
For a new input, let all the classifiers vote, and pick the class that wins the most votes.
Less common for logistic regression; more common with SVMs.
E.g. 1v1, 1v2, 1v3, 1v4, 2v3, 2v4, 3v4
n * (n-1) / 2 (i.e. 6) binary classifiers - each on a fraction of the data

Total classifiers = n(n-1)/2, where n is the number of classes.

20
Q

List the pros of Logistic Regression.

A
  • Quick to train
  • Extendable to multi-class
  • Less prone to overfitting with regularization
  • Coefficients are interpretable
  • Resistant to overfitting

These features make it a popular choice for classification tasks.

21
Q

List the cons of Logistic Regression.

A
  • Assumes linear decision boundary
  • Not suitable for very complex decision surfaces

These limitations can affect performance in certain scenarios.

22
Q

What is the goal of gradient descent?

A
  • Find weights ww that minimize cross entropy loss
  • Use chain rule to differentiate sigmoid and loss
  • Learning rate λ\lambda: Controls step size

If the true label is 1:
- If prediction y​ is close to 1, loss ≈ 0 (good!)
- If prediction y​ is close to 0, loss is very large (bad!)

If the true label is 0:
- If prediction y is close to 0, loss ≈ 0 (good!)
- If prediction y is close to 1, loss is very large (bad!)

Interpretation:
The further your predicted probability is from the actual class, the more you’re penalized.

23
Q

What is the softmax approach?

A
  • multinomial
  • directly compute probabilities for each class