Text classification 1: Log-linear models Flashcards

Lecture 3 (17 cards)

1
Q

Compute the gradient of this function:

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the challenges of dealing with human language?

A
  • It is abiguous (I ate pizza with friends - together or are friends pizza toppings?)
  • Different sentences with the same meaning
  • Language is discrete, which means that the language consists of letters, letters make words etc. Word pizza for example doesn’t mean anything to a machine but we can visualize the object when we read the word pizza
  • We can’t easily do some math to go from one word to another word (like with images and transformations of pizels)
  • ## Words might have different meaning depending on the surrounding context
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the inductive bias?

A

It is a restriction of our experiment (hypothesis) by making a set of assumptions about our desired solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is tokenization?

A

It is a process that splits text into a sequence of tokens (subwords/characters etc.) according to some algorithm (whitespace, BPE…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is vocabulary?

A

A set of all possible tokens/words that could come in the text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is one-hot encoding?

A

It is a way to encode tokens/words into numerical vectors. Vectors have a size of the vocabulary and all of them are zeros except the index of that token in the vocabulary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some problems with one-hot encoding?

A
  • very sparse
  • no way to contextulaze based on the surrounding words
  • no info about word order
  • no word similarity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to get the one-hot encoding of the whole sentence? What are problems?

A

By using (averaged) Bag of Words. Basically combining all vectors into one by summing them, and then dividing it by the vocab size.

  • ## vector doesn’t change if we change word order
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can we measure how close two one-hot vectors are?

A

Using Hamming distance which simply counts how many different bits between two vectors exist. Distance between two one-hot vectors is always 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is OOV?

A

Out of Vocabulary token - token that we don’t have in the vocab

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How to deal with OOVs?

A

Just use UNK token (special token)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to determine the size of the vocabulary?

A

Using Zipf’s law that tell us that some words are more frequent than others, and they form a tail-like distribution:

We can determine the cutoff point based on this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain Byte-pair encoding

A
  • start with characters
  • merge most frequent pairs of characters and it is added as a new token to vocab
  • repeat fixed amount of times
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain SentencePiece

A

Nowadays mainly used encoding.

It is a variant of BPE where it replaces empty spaces with an underscore and treats is as a character, while in normal BPE we have to keep track of white spaces and based on that reconstruct the sentence (sentencepiece does this automatically by simply combining tokens)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do we squeeze numbers from -inf -> inf, to 0,1?

A

Using sigmoid function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a sigmoid functions? What is its formula? WHere is it used?

A

A function that squeezes any real number to range [0, 1]. It is used at the end of the binary linear regression for example to have number between 0 and 1 so it is easier to determine if the predicted class is 0 or 1.

17
Q

How would a simple binary classification task look like using 1 set of neurons?

A

x - input embedding vector (like BoW)
W - weights of a matrix, trainable
b - bias (shifting the regression line)
y - ground truth, compared with predicted value to obtain the loss