Text Classification 2: Deep Neural Networks Flashcards

Lecture 4 (18 cards)

1
Q

How do we change the parameters in our models so it makes sense? Should we move left or right?

A

It is based on the loss function. We want to minimize the loss function (some difference measure between ground truth and the predicted value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a loss function?

A

It if a function that takes the ground truth and predicted value, and outputs a value called loss, which tells us how similar those two values are (how close the model is to predicting the correct answer)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some loss function used for binary classification?

A
  • we can use absolute value of the difference between y and y_hat, but the derivative of this function will always be either 1or -1 which doesn’t tell us anything
  • we can use logistic loss (binary cross-entropy loss). It works better than most of the losses for binary classification (see formula)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the logistic loss? Give formula

A

Logistic loss or binary cross entropy loss, is a loss function used for binary classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Binary cross-entropy? Give formula

A

Logistic loss or binary cross entropy loss, is a loss function used for binary classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the (online) stochastic gradient descent? What are steps, write pseudo code.

A

It is computing the gradient given a function, input data, ground truth data, and some loss function. For each training example, we compute the loss and update the parameters using the gradient (theta - learning_rate*gradient wrt theta)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the mini-batch stochastic gradient descent? What are steps, write pseudo code.

A

Similar to online SGD, but instead of computing gradients and updating the parameters on each training example, we compute an average gradient over some small batch size m, and then we update. This is useful since the data can be noisy and the gradient might not converge smoothly, might go all over the place. This kinda smooths it.,

Adv of this that the lines 6-7 can be parallelized. Compute loss of each training sample separately on individual threads, and then combine at the end

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between online and minibatch gradient descent?

A

Online one is basically a minibatch one with m=1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What happens when we go from binary classification (using regression) to multi-class?

A
  • Instead of one output (scalar), we get a vector of values indicating how confident the model is in predicting n-th class
  • W is a matrix now, and is multiplied with the input vector
  • bias b is also a vector
  • loss is now the categorical cross-entropy (negative log likelihood)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What can we read from the weight matrix W in multi-class classification (regression)? Assume we are predicting a language based on some text.

A

If rows are out input tokens (vocabulary) which are bad of words representation and columns are different languages, then each column is a representation of a languages (frequencies of the tokens/words per language). Using columns, we can cluster languages based on similarity (similar languages will have similar number of some tokens in text)

If we look at each row, then we get a vector how often each token exists per language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is CBOW? Derive it

A

Continuous Bag of Words (CBOW) is a way to represent sentences, similar to average BoW.

We still represent tokens as one-hot vector, sum them up and average them, but now we also multiply it with the embedding matrix W that is trainable.

When deriving the last step, if we multiply a one-hot vector with a matrix, we are basically taking the i-th row of the matrix (i-th is a position where 1 is and the rest are 0s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is softmax? What is its formula?

A

It is a function that takes a vector of numbers and transforms it into a probabilitic distribution vector (they sum up to 1 and are [0, 1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is temperature in softmax? How does it fit in the formula?

A

It is a way to smooth the distribution (or make it more deterministic).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is negative log likelihood? What is its formula?

A

Negative log likelihood or categorical cross-entropy loss: It is a loss function for multi-class classification, a more general case of the binary cross-entropy (num of categories = 2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is categorical cross-entropy? What is its formula?

A

Negative log likelihood or categorical cross-entropy loss: It is a loss function for multi-class classification, a more general case of the binary cross-entropy (num of categories = 2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

If we stack more linear transformation on top of each other, can we become non-linear?

A

No, since linear transformation of linear transformation is still a linear transformation. We need some non-linear function in between them to make it non-linear

17
Q

How would a multi-class classification look like in the end?

18
Q

What is ReLU? What is its formula?

A

It is a non-linear function (rectified linear unit) used in-between two linear transformations (layers of NN) to make the classification non-linear.