Week 2: Regression & Classification (Linear & Nonlinear Models) Flashcards

1
Q

Perceptron Training Rule

A

Linear classification models can draw decision boundaries between regions, with each region representing its own class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Error Rate for Classification Models

A

Error rate = 1 - \frac{1}{m}\sum_{i=1}^m score, with score = 0 for misclassifications and 1 for correct classifications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Error Rate for Regression Models

A

Error rate = \sum_{i=1}^m (y - y_i)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Training Models

A

This applies to both classification and regression models.

Training:
1. Select the training set
2. Initialise model parameters
3. Apply the model to all training set instances
4. Computer the error rate
5. Adjust the parameters to obtain a model w/ lower error
6. Repeat from step 3 until desirable error rate reached
7. Output the training error

Evaluation:
1. Select the test set
2. Apply the model to all test set instances
3. Compute the error rate
4. Output the evaluation error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Cross-validation

A
  1. Split the dataset in N approximately equal-sized folds.
  2. Perform N repetitions where one fold is used for testing and the remaining folds are used in training.
  3. Compute the error rate N times after each repetition and average the results to yield the overall error rate.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Spurious Correlations

A

Just because correlation exists, doesn’t mean there’s a causal relationship between the variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Linear Regression

A

\hat{y}i = x_i w = w_0 + \sum{j=1}^n w_j x_i,j

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Nonlinear Regression

A

These can include interaction terms and polynomial terms. Ex. \hat{y}_i = w_0 + w_1 \cdot x_i + w_2 \cdot x_i^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Underfitting

A

When the model doesn’t predict the training data well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Overfitting

A

When the model fits the training data relatively well, but fails to generalise to unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Mean of Squared Errors

A

S(w) = \frac{1}{m} \sum_{i=1}^m (y_i - \hat{i})^2. Linear Regression models try to minimise Mean of Squared Errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Gradient Descent

A

Initialise weights to 0 or to random values.

Until convergence is achieved:
for i \in {1,…,m}
for j \in {1,…,n}
w_j \leftarrow w_j + \alpha(y_i - \hat{y}i)x{i,j}

Termination criteria: \left\lvert S(w^k) - S(w^{k+1}) \right\rvert < \epsilon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Single-scan/On-line Algorithm

A

for i \in {1,…,m}:
repeat:
for j \in {1,…,n}:
w_j \leftarrow w_j + \alpha(y_i - \hat{y}i) x{i,j}
until S(w) isn’t significantly changed

This method updates after each individual example. Other names include online approximation, stochastic gradient descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Logistic Regression

A

This method assumes binary classification. If y \le 0.5, then predict 0. EIse y > 0.5, then predict 1. \hat{y} = \frac{1}{1 + exp[-(w_) + w_1 x_1 + … + w_n x_N)]}. Essentially if w^T x > 0, return 1. If w^T x \le 0, return 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Perceptrons

A

This method learns a hyperplane separating two classes. Perceptrons form the building blocks of neural networks, such as single-layer feed-forward neural networks. They use the perceptron learning rule.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Squared Error

A

The perceptron learning rule uses this metric. squared_error = \sum_{i=1}^m (y_i - \hat{y}_i)^2

17
Q

Perceptron Learning Rule

A

It utilises gradient descent. For each i \in {1,…,m} and j \in {1,…,n}, w_j \leftarrow w_j + \alpha (y_i - \hat{y}i)x{i,j}

18
Q

Feed-Forward Multilayer Neural Network

A

These have inputs, hidden units, and outputs. The network only goes in one direction towards the output.

19
Q

Neural Network Unit

A

Each node has an input fed into an input function, an activation function, and an output function. Each node has input links and output links.

20
Q

Sigmoid Activation Function

A

f(x) = \frac{1}{1+e^{-x}}

21
Q

Hyperbolic Tangent (Tanh) Function

A

tanh(x) = \frac{2}{1 + e^{-2x}} - 1

22
Q

Rectified Linear Unit (ReLU)

A

f(x) = 0 for x < 0, x for x \ge 0

23
Q

Activation Functions

A

Ideal properties include nonlinear (can generalise well), differentiable (can update weights during training), and monotonic (for fast convergence)

24
Q

Neural Networks

A

Generally, there’s a specific cost function that is minimised when training neural networks.

Important factors to consider include:

  1. Number of layers
  2. Number of nodes per layer
  3. Number of incoming links per node
  4. Activation Function

Pros:
1. Great at nonlinear transformations of input.
2. Highly parameterised and can model even small function irregularities.
3. Even a small number of hidden layers can sufficiently model any continuous function.
4. Can be optimised to reduce overfitting.

Cons:
1. Explainibility is challenging, making it difficult to infer causal relationships.
2. Computationally expensive
3. Needs a very large dataset to work properly.

25
Q

Recurrent Network

A

This type of neural network feeds outputs back to its own inputs. Network activation levels form a dynamic system that may reach a steady state or show oscillations and potentially chaotic behaviour.

26
Q

Backpropagation

A

Given a specific weight w,

w \leftarrow w + \Delta w = w - \eta \frac{\partial J}{\partial w}

27
Q

Lasso Regression / L1 Regularisation

A

It uses a diamond-shaped region to eliminate irrelevant variables.

28
Q

Ridge Regression / L2 Regularisation

A

It uses a disk-shaped region to reduce the coefficients of irrelevant predictors, which approach 0 but doesn’t equal 0.