NLP_DL (Oxford) Flashcards Preview

ML cards > NLP_DL (Oxford) > Flashcards

Flashcards in NLP_DL (Oxford) Deck (21):
1

Distributional hypothesis

words that occur in the same contexts tend to have the same meanings

2

Negative sampling

- related to word2vec paper:
in the formula / sum () the denominator is computed using only a random sample of 'negative' contexts (different from the numerator one)

3

Being 'grounded in the task'

= 'meaning' (as word - in the context of course's taks-specific features)

4

CBoW steps (Mikolov, 2013)

- embed words to vectors
- sum vectors
- project the result back to vocabulary size
- apply softmax

5

n-gram model ~ k-th order Markov model

- only immediate history counts
- with limited history for previous k-1 items

6

linear interpolation of probability of a 3-gram

linear combination of trigram, bigram, and unigram with coefficients sum equal to 1

7

training objective for feedforward NN

- cross-entropy of the data given the model:
-(1/N) sum(cost(w_n, p_n))
with cost being the log-probability:
cost(a, b) = a^T log b

8

NNLM comparison with N-Gram LM - good

- good generalization on unseen n-grams, poorer on seen ones; solution: use n-gram features
- smaller memory footprint

9

NNLM comparison with N-Gram LM - bad

- number of parameters scales with n-gram size
- doesn't take into account the frequencies of words
- limited the length of the dependencies captured (n-gram size)

10

RNNLM comparison with N-Gram LM - good

- can represent unbounded dependencies
- compress history into fixed size vector
- the number of params grow only with the information stored in the hidden layer

11

RNNLM comparison with N-Gram LM - bad

- difficult to learn
- memory and computation complexity increase quadratically with the size of the hidden layer
- doesn't take into account the frequencies of words

12

Statistical text classification (STC) - key questions

2 steps process in order to calculate P(c|d):
- process text for representation: how to represent d
- classify the document using the text representation: how to calculate P(c|d)

13

STC - generative models

- joint model P(c, d)
- model the distribution of individual classes
- place probabilities over both hidden vars (classes) and observed data

14

STC - discriminative models

- conditional model P(c|d)
- learn boundaries between classes
- with data as given place probabilities over hidden vars

15

STC - Naive Bayes: pros and cons

Pros:
- simple
- fast
- uses BOW representation
- interpretable
Cons:
- feature independence condition too strong
- doc structure/semantic ignored
- require smoothing to deal with 0 probabilities

16

STC - logistic regression: pros and cons

Pros:
- interpretable
- relativly simple
- no assumptions of independence between fewatures
Cons:
- harder to learn
- manually designed features
- more difficult to generalize well bc of the hand crafted features

17

RNNLM text representation

- RNNLM is agnostic to the recurrent function
- it reads input x_i to accumulate state h_i, and predict output y_i
- for text representation h_i is a function value dependent of x_{0:i} and h_{0:i-1} meaning that it contains info about all text up to time-step i
- thus h_n is the text representation of the input document and can be used for d (data, X=h_n) in logistic regression or any other classifier

18

RNNLM + Logistic regression steps

No RNN output layer y is needed!
- take RNN state as input: X = h_n
- compute class c weights: f_c = sum(beta_ci * X_i)
- apply nonlinearity: m_c = sigma(f_c)
- compute p(c|d): p(c|d) = softmax(m_c, m_{0:C})
- loss function: cross-entropy between the estimated class distribution, p(c|d) and true distribution
L_i = - sum_c(y_c * log P(c|d_i)

19

RNN mechanics

It compresses the entire history into a fixed length vector to capture 'long range' correlations

20

LSTM gates

- provide the way to optionally let information through
- are composed out of a sigmoid neural net layer and a pointwise multiplication operation
- LSTM cell uses three such gates:
- 'forget gate layer' (on hidden state)
- 'input gate layer' (on input data)
- 'output gate layer' (on output data)

21

GRU gates vs LSTM

GRU changes:
- merges 'input' and 'forget' gates into one 'update gate layer'
- merges the hidden state with the output state