Statistical Language Modelling Flashcards
(11 cards)
what is NLP
Natural Language Processing builds systems that uses computational techniques to model and process natural languae in an automated way.
what is word-level processing
Before doing any text processing, we need to prepare out input data into sentences, then words, then tokens
What are the three ways of predicting the probability of a sequence
- Spellchecking
- Grammatical error correction
- Autocomplete/suggestions
what are the 4 n-grams
Unigrams
Bigrams
Trigrams
Quadrigrams
what is smoothing
techniques to ensure a low probability for unseen combinations without compromising the overall statistics of the training set
what are the 3 types of smoothing
- Laplace smoothing
- add-k smoothing
- Kneser-Ney smoothing
what is laplace smoothing
adds one to all counts.
what is add-k smoothing
rather than adding 1 to all counts, we can generalize to arbitrary k (typically between 0 and 1).
what n-gram is better?
- For higher n we capture more context and so we can make better predictions
- but for higher n, we also need more data and inevitably it will be sparse
How can we maximize probabilities
Given a tarin set and a test set, we want the model to maximize the probability of the test set. For bigarms this means we want to maximize:
p(w1w2…wn)=P(wi|wi-1)