Flashcards in NLP_DL (Oxford) Deck (21):

1

## Distributional hypothesis

### words that occur in the same contexts tend to have the same meanings

2

## Negative sampling

###
- related to word2vec paper:

in the formula / sum () the denominator is computed using only a random sample of 'negative' contexts (different from the numerator one)

3

## Being 'grounded in the task'

### = 'meaning' (as word - in the context of course's taks-specific features)

4

## CBoW steps (Mikolov, 2013)

###
- embed words to vectors

- sum vectors

- project the result back to vocabulary size

- apply softmax

5

## n-gram model ~ k-th order Markov model

###
- only immediate history counts

- with limited history for previous k-1 items

6

## linear interpolation of probability of a 3-gram

### linear combination of trigram, bigram, and unigram with coefficients sum equal to 1

7

## training objective for feedforward NN

###
- cross-entropy of the data given the model:

-(1/N) sum(cost(w_n, p_n))

with cost being the log-probability:

cost(a, b) = a^T log b

8

## NNLM comparison with N-Gram LM - good

###
- good generalization on unseen n-grams, poorer on seen ones; solution: use n-gram features

- smaller memory footprint

9

## NNLM comparison with N-Gram LM - bad

###
- number of parameters scales with n-gram size

- doesn't take into account the frequencies of words

- limited the length of the dependencies captured (n-gram size)

10

## RNNLM comparison with N-Gram LM - good

###
- can represent unbounded dependencies

- compress history into fixed size vector

- the number of params grow only with the information stored in the hidden layer

11

## RNNLM comparison with N-Gram LM - bad

###
- difficult to learn

- memory and computation complexity increase quadratically with the size of the hidden layer

- doesn't take into account the frequencies of words

12

## Statistical text classification (STC) - key questions

###
2 steps process in order to calculate P(c|d):

- process text for representation: how to represent d

- classify the document using the text representation: how to calculate P(c|d)

13

## STC - generative models

###
- joint model P(c, d)

- model the distribution of individual classes

- place probabilities over both hidden vars (classes) and observed data

14

## STC - discriminative models

###
- conditional model P(c|d)

- learn boundaries between classes

- with data as given place probabilities over hidden vars

15

## STC - Naive Bayes: pros and cons

###
Pros:

- simple

- fast

- uses BOW representation

- interpretable

Cons:

- feature independence condition too strong

- doc structure/semantic ignored

- require smoothing to deal with 0 probabilities

16

## STC - logistic regression: pros and cons

###
Pros:

- interpretable

- relativly simple

- no assumptions of independence between fewatures

Cons:

- harder to learn

- manually designed features

- more difficult to generalize well bc of the hand crafted features

17

## RNNLM text representation

###
- RNNLM is agnostic to the recurrent function

- it reads input x_i to accumulate state h_i, and predict output y_i

- for text representation h_i is a function value dependent of x_{0:i} and h_{0:i-1} meaning that it contains info about all text up to time-step i

- thus h_n is the text representation of the input document and can be used for d (data, X=h_n) in logistic regression or any other classifier

18

## RNNLM + Logistic regression steps

###
No RNN output layer y is needed!

- take RNN state as input: X = h_n

- compute class c weights: f_c = sum(beta_ci * X_i)

- apply nonlinearity: m_c = sigma(f_c)

- compute p(c|d): p(c|d) = softmax(m_c, m_{0:C})

- loss function: cross-entropy between the estimated class distribution, p(c|d) and true distribution

L_i = - sum_c(y_c * log P(c|d_i)

19

## RNN mechanics

### It compresses the entire history into a fixed length vector to capture 'long range' correlations

20

## LSTM gates

###
- provide the way to optionally let information through

- are composed out of a sigmoid neural net layer and a pointwise multiplication operation

- LSTM cell uses three such gates:

- 'forget gate layer' (on hidden state)

- 'input gate layer' (on input data)

- 'output gate layer' (on output data)

21