NLP_DL (Oxford) Flashcards Preview

ML cards > NLP_DL (Oxford) > Flashcards

Flashcards in NLP_DL (Oxford) Deck (21):

Distributional hypothesis

words that occur in the same contexts tend to have the same meanings


Negative sampling

- related to word2vec paper:
in the formula / sum () the denominator is computed using only a random sample of 'negative' contexts (different from the numerator one)


Being 'grounded in the task'

= 'meaning' (as word - in the context of course's taks-specific features)


CBoW steps (Mikolov, 2013)

- embed words to vectors
- sum vectors
- project the result back to vocabulary size
- apply softmax


n-gram model ~ k-th order Markov model

- only immediate history counts
- with limited history for previous k-1 items


linear interpolation of probability of a 3-gram

linear combination of trigram, bigram, and unigram with coefficients sum equal to 1


training objective for feedforward NN

- cross-entropy of the data given the model:
-(1/N) sum(cost(w_n, p_n))
with cost being the log-probability:
cost(a, b) = a^T log b


NNLM comparison with N-Gram LM - good

- good generalization on unseen n-grams, poorer on seen ones; solution: use n-gram features
- smaller memory footprint


NNLM comparison with N-Gram LM - bad

- number of parameters scales with n-gram size
- doesn't take into account the frequencies of words
- limited the length of the dependencies captured (n-gram size)


RNNLM comparison with N-Gram LM - good

- can represent unbounded dependencies
- compress history into fixed size vector
- the number of params grow only with the information stored in the hidden layer


RNNLM comparison with N-Gram LM - bad

- difficult to learn
- memory and computation complexity increase quadratically with the size of the hidden layer
- doesn't take into account the frequencies of words


Statistical text classification (STC) - key questions

2 steps process in order to calculate P(c|d):
- process text for representation: how to represent d
- classify the document using the text representation: how to calculate P(c|d)


STC - generative models

- joint model P(c, d)
- model the distribution of individual classes
- place probabilities over both hidden vars (classes) and observed data


STC - discriminative models

- conditional model P(c|d)
- learn boundaries between classes
- with data as given place probabilities over hidden vars


STC - Naive Bayes: pros and cons

- simple
- fast
- uses BOW representation
- interpretable
- feature independence condition too strong
- doc structure/semantic ignored
- require smoothing to deal with 0 probabilities


STC - logistic regression: pros and cons

- interpretable
- relativly simple
- no assumptions of independence between fewatures
- harder to learn
- manually designed features
- more difficult to generalize well bc of the hand crafted features


RNNLM text representation

- RNNLM is agnostic to the recurrent function
- it reads input x_i to accumulate state h_i, and predict output y_i
- for text representation h_i is a function value dependent of x_{0:i} and h_{0:i-1} meaning that it contains info about all text up to time-step i
- thus h_n is the text representation of the input document and can be used for d (data, X=h_n) in logistic regression or any other classifier


RNNLM + Logistic regression steps

No RNN output layer y is needed!
- take RNN state as input: X = h_n
- compute class c weights: f_c = sum(beta_ci * X_i)
- apply nonlinearity: m_c = sigma(f_c)
- compute p(c|d): p(c|d) = softmax(m_c, m_{0:C})
- loss function: cross-entropy between the estimated class distribution, p(c|d) and true distribution
L_i = - sum_c(y_c * log P(c|d_i)


RNN mechanics

It compresses the entire history into a fixed length vector to capture 'long range' correlations


LSTM gates

- provide the way to optionally let information through
- are composed out of a sigmoid neural net layer and a pointwise multiplication operation
- LSTM cell uses three such gates:
- 'forget gate layer' (on hidden state)
- 'input gate layer' (on input data)
- 'output gate layer' (on output data)


GRU gates vs LSTM

GRU changes:
- merges 'input' and 'forget' gates into one 'update gate layer'
- merges the hidden state with the output state