Week 4 Flashcards

1
Q

Text pre-processing steps

A

Document-level preparation
- Document conversion
- Language/domain identification

Tokenisation
- Case folding

Basic lexical pre-processing
- Lemmatization
- Stemming
- Spelling correction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Lexical Processing

A

Words in a given context

Two main tasks:
Normalisation (stemming, lemmatisation)
POS Tagging (verbs, nouns, etc…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Normalisation

A

Map tokens to normalised forms

{walks, walked, walk, walking} -> walk

Two principle approaches:
- Lemmatisation
- Stemming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Lemmatisation

A

Reduction to “dictionary headword” form (lemma)

Language Dependent

How to do:
Dictionaries - look-up might be slow
Morphological analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Morphological analysis

A

Sub words (morphemes) have some ‘meaning’

un + happy + ness = unhappiness

Take a word and see what it’s stems and affixes are

Typically rule-based approach

Regular inflectional morphology is “easy” (in English)
Nouns are simple
Verbs are only slightly more complex
Irregular words are difficult (may still need a dictionary)

Depends on context

Any approach is quite computationally expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Morpheme

A

The smallest constituent part of a word
Could be “dog” or “-s” but not “dogs”, it cannot be broken down

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Derivational morphology is messy

A

Quasi-systematicity
Irregular meaning change
Changes of word class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Stemming

A

Chop ends of words (typically) before indexing
- Remove suffixes, possibly prefixes

studies -> sutdi
studying -> study

Language dependent, often heuristic, crude

May yield forms that are not words
automate, automatic -> automat

Neutralises inflections and some derivation

Merge related words into conflation groups

Useful for some applications, but may also be aggressive and conflate some unrelated groups
e.g. experiment, experience -> experi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

conflation groups

A

The groups stemming merges related words into

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Inflection

A

The modification of a word to express different grammatical categories / roles:

tense
person
number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Under-stemming

A

Failure to conflate related forms
divide -> divid
division -> divis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Over-stemming

A

conflates unrelated forms
neutron, neutral -> neutr

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Porter Stemmer

A

One of the most common algorithms for stemming in English

Rule-based = implement specific patterns

readily available in many languages

Main idea: “model” what the endings look like and strip them

In English, patterns depend in particular on ‘syllables’ at the end of the word and how long the word is

Step 1: Get rid of plurals and -ed or -ing suffixes
eed -> ee, agreed -> agree
If m > 0

Step 2: Turn terminal y to i when there is another vowel in the stem If m > 0

Step 3: Map double suffixes to single ones
If m > 0

Step 4: Deal with suffixes, -full, -ness, etc…
If m > 0

Step 5: Take off -ant -ence, etc…
If m > 1

Step 6: Tidy up (remove -e and double letters)
if m > 1 and last letter is e, remove.
probate -> probat

if m > 1 and last letter is l or d, single letter
controll -> control

Other rules can be applied
- conflict resolution, if more rules apply, use on with longest matching suffix for the given word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Spelling error probabilities

A

0.05% in carefully edited newswire

26% in web queries

On average, 1 word in every tweet is misspelled

80% errors are 1 single-error mistakes

Almost all errors within edit distance 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Types of spelling error

A

Non-word errors
graffe -> giraffe

Real-word errors
piece -> peace This needs context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Non-word spelling errors

A

Non-word errors
graffe -> giraffe

17
Q

Real- word spelling errors

A

piece -> peace This needs context

18
Q

Spelling Error Mistakes

A

Insertion (ther)
Deletion (th)
Substitution (thw)
Transposition (teh)

19
Q

Causes of spelling errors

A

Typographical
- Keyboard related errors (80% novice, 51% overall)
- Immediately adjacent keys (50% of novice, 31% of all)

Cognitive
- Phonetic - seperate -> separate
- Homophones - there -> their, piece -> peace

Optical character recognition
- “D” -> “O”, “ri” -> n

20
Q

Spelling error detection

A

Non-word errors
Any word not in a dictionary is an error
- The larger the dictionary the better … up to a point
- (The web is full of mis-spellings, so the web isn’t necessarily a good option)

21
Q

Generating spelling error candidates

A

Take the four common errors
Insertion (ther)
Deletion (th)
Substitution (thw)
Transposition (teh)
And generate candidates that are <= k edit distance from the error

Can compute the fast using a Levenshstein Finite State Transducer

Or having a precomputed list of possible corrections

22
Q

Information Retrieval paradigm

A

We’d like to get the best of something, instead of finding the best we:
Find a (very) good candidate set (spelling candidates k errors<2)
Find the K best amongst them and return them as the best

These may not actually be the best

23
Q

Methods to estimate P(x|w) for noisy channel / error probability model

A

Simple local factors - consider the most important factors predicting an, insertion, deletion, transposition
- Unequal weightings attached to different editing operations
- Insertion and deletion probabilities are conditioned on context. The probability of inserting or deleting a character is conditioned on the letter appearing immediately to the left of that character
- Positional Information, the position in the string where the edit occurs

24
Q

How does Google do spellchecking?

A

Google figures out possible misspellings and their likely correct spellings by using words it finds while searching the web and processing user queries

25
How to find candidates for real world spelling errors
For each word w, generate a candidate set: - similar pronounciations - similar spellings - include w in the set Choose best candidate - Noisy Channel view of spell errors - Context-sensitive - so have to consider whether the surrounding words "make sense"
26
Porter Stemmer Notation - C
A list of c (consonants) longer than 1
27
Porter Stemmer Notation - V
A list of v (vowels) longer than 1
28
Porter Stemmer Notation - []
[C] is the arbitrary presence of a consonant
29
Porter Stemmer Notation - (condition)m
The condition repeated m times
30
Porter Stemmer Pattern words are decomposed into
[C](VC)m[V] m is then referred to as the words measure