Text Classification 2: Deep Neural Networks Flashcards

Question 1

Q

How do we change the parameters in our models so it makes sense? Should we move left or right?

Answer

A

It is based on the loss function. We want to minimize the loss function (some difference measure between ground truth and the predicted value)

Question 2

Q

What is a loss function?

Answer

A

It if a function that takes the ground truth and predicted value, and outputs a value called loss, which tells us how similar those two values are (how close the model is to predicting the correct answer)

Question 3

Q

What are some loss function used for binary classification?

Answer

A

we can use absolute value of the difference between y and y_hat, but the derivative of this function will always be either 1or -1 which doesn’t tell us anything
we can use logistic loss (binary cross-entropy loss). It works better than most of the losses for binary classification (see formula)

Question 4

Q

What is the logistic loss? Give formula

Answer

A

Logistic loss or binary cross entropy loss, is a loss function used for binary classification

Question 5

Q

What is Binary cross-entropy? Give formula

Answer

A

Logistic loss or binary cross entropy loss, is a loss function used for binary classification

Question 6

Q

What is the (online) stochastic gradient descent? What are steps, write pseudo code.

Answer

A

It is computing the gradient given a function, input data, ground truth data, and some loss function. For each training example, we compute the loss and update the parameters using the gradient (theta - learning_rate*gradient wrt theta)

Question 7

Q

What is the mini-batch stochastic gradient descent? What are steps, write pseudo code.

Answer

A

Similar to online SGD, but instead of computing gradients and updating the parameters on each training example, we compute an average gradient over some small batch size m, and then we update. This is useful since the data can be noisy and the gradient might not converge smoothly, might go all over the place. This kinda smooths it.,

Adv of this that the lines 6-7 can be parallelized. Compute loss of each training sample separately on individual threads, and then combine at the end

Question 8

Q

What is the difference between online and minibatch gradient descent?

Answer

A

Online one is basically a minibatch one with m=1

Question 9

Q

What happens when we go from binary classification (using regression) to multi-class?

Answer

A

Instead of one output (scalar), we get a vector of values indicating how confident the model is in predicting n-th class
W is a matrix now, and is multiplied with the input vector
bias b is also a vector
loss is now the categorical cross-entropy (negative log likelihood)

Question 10

Q

What can we read from the weight matrix W in multi-class classification (regression)? Assume we are predicting a language based on some text.

Answer

A

If rows are out input tokens (vocabulary) which are bad of words representation and columns are different languages, then each column is a representation of a languages (frequencies of the tokens/words per language). Using columns, we can cluster languages based on similarity (similar languages will have similar number of some tokens in text)

If we look at each row, then we get a vector how often each token exists per language

Question 11

Q

What is CBOW? Derive it

Answer

A

Continuous Bag of Words (CBOW) is a way to represent sentences, similar to average BoW.

We still represent tokens as one-hot vector, sum them up and average them, but now we also multiply it with the embedding matrix W that is trainable.

When deriving the last step, if we multiply a one-hot vector with a matrix, we are basically taking the i-th row of the matrix (i-th is a position where 1 is and the rest are 0s)

Question 12

Q

What is softmax? What is its formula?

Answer

A

It is a function that takes a vector of numbers and transforms it into a probabilitic distribution vector (they sum up to 1 and are [0, 1]

Question 13

Q

What is temperature in softmax? How does it fit in the formula?

Answer

A

It is a way to smooth the distribution (or make it more deterministic).

Question 14

Q

What is negative log likelihood? What is its formula?

Answer

A

Negative log likelihood or categorical cross-entropy loss: It is a loss function for multi-class classification, a more general case of the binary cross-entropy (num of categories = 2)

Question 15

Q

What is categorical cross-entropy? What is its formula?

Answer

A

Negative log likelihood or categorical cross-entropy loss: It is a loss function for multi-class classification, a more general case of the binary cross-entropy (num of categories = 2)

Question 16

Q

If we stack more linear transformation on top of each other, can we become non-linear?

Answer

Study These Flashcards

A

No, since linear transformation of linear transformation is still a linear transformation. We need some non-linear function in between them to make it non-linear