Text classification 1: Log-linear models Flashcards

Question 1

Q

Compute the gradient of this function:

Question 2

Q

What are the challenges of dealing with human language?

Answer

A

It is abiguous (I ate pizza with friends - together or are friends pizza toppings?)
Different sentences with the same meaning
Language is discrete, which means that the language consists of letters, letters make words etc. Word pizza for example doesn’t mean anything to a machine but we can visualize the object when we read the word pizza
We can’t easily do some math to go from one word to another word (like with images and transformations of pizels)
## Words might have different meaning depending on the surrounding context

Question 3

Q

What is the inductive bias?

Answer

A

It is a restriction of our experiment (hypothesis) by making a set of assumptions about our desired solution

Question 4

Q

What is tokenization?

Answer

A

It is a process that splits text into a sequence of tokens (subwords/characters etc.) according to some algorithm (whitespace, BPE…)

Question 5

Q

What is vocabulary?

Answer

A

A set of all possible tokens/words that could come in the text

Question 6

Q

What is one-hot encoding?

Answer

A

It is a way to encode tokens/words into numerical vectors. Vectors have a size of the vocabulary and all of them are zeros except the index of that token in the vocabulary

Question 7

Q

What are some problems with one-hot encoding?

Answer

A

very sparse
no way to contextulaze based on the surrounding words
no info about word order
no word similarity

Question 8

Q

How to get the one-hot encoding of the whole sentence? What are problems?

Answer

A

By using (averaged) Bag of Words. Basically combining all vectors into one by summing them, and then dividing it by the vocab size.

## vector doesn’t change if we change word order

Question 9

Q

How can we measure how close two one-hot vectors are?

Answer

A

Using Hamming distance which simply counts how many different bits between two vectors exist. Distance between two one-hot vectors is always 2

Question 10

Q

What is OOV?

Answer

A

Out of Vocabulary token - token that we don’t have in the vocab

Question 11

Q

How to deal with OOVs?

Answer

A

Just use UNK token (special token)

Question 12

Q

How to determine the size of the vocabulary?

Answer

A

Using Zipf’s law that tell us that some words are more frequent than others, and they form a tail-like distribution:

We can determine the cutoff point based on this

Question 13

Q

Explain Byte-pair encoding

Answer

A

start with characters
merge most frequent pairs of characters and it is added as a new token to vocab
repeat fixed amount of times

Question 14

Q

Explain SentencePiece

Answer

A

Nowadays mainly used encoding.

It is a variant of BPE where it replaces empty spaces with an underscore and treats is as a character, while in normal BPE we have to keep track of white spaces and based on that reconstruct the sentence (sentencepiece does this automatically by simply combining tokens)

Question 15

Q

How do we squeeze numbers from -inf -> inf, to 0,1?

Answer

A

Using sigmoid function

Question 16

Q

What is a sigmoid functions? What is its formula? WHere is it used?

Answer

Study These Flashcards

A

A function that squeezes any real number to range [0, 1]. It is used at the end of the binary linear regression for example to have number between 0 and 1 so it is easier to determine if the predicted class is 0 or 1.

Question 17

Q

How would a simple binary classification task look like using 1 set of neurons?

Answer

Study These Flashcards

A

x - input embedding vector (like BoW)
W - weights of a matrix, trainable
b - bias (shifting the regression line)
y - ground truth, compared with predicted value to obtain the loss

Text classification 1: Log-linear models Flashcards

Lecture 3 (17 cards)