Week 4 Flashcards

1
Q

Text pre-processing steps

A

Document-level preparation
- Document conversion
- Language/domain identification

Tokenisation
- Case folding

Basic lexical pre-processing
- Lemmatization
- Stemming
- Spelling correction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Lexical Processing

A

Words in a given context

Two main tasks:
Normalisation (stemming, lemmatisation)
POS Tagging (verbs, nouns, etc…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Normalisation

A

Map tokens to normalised forms

{walks, walked, walk, walking} -> walk

Two principle approaches:
- Lemmatisation
- Stemming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Lemmatisation

A

Reduction to “dictionary headword” form (lemma)

Language Dependent

How to do:
Dictionaries - look-up might be slow
Morphological analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Morphological analysis

A

Sub words (morphemes) have some ‘meaning’

un + happy + ness = unhappiness

Take a word and see what it’s stems and affixes are

Typically rule-based approach

Regular inflectional morphology is “easy” (in English)
Nouns are simple
Verbs are only slightly more complex
Irregular words are difficult (may still need a dictionary)

Depends on context

Any approach is quite computationally expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Morpheme

A

The smallest constituent part of a word
Could be “dog” or “-s” but not “dogs”, it cannot be broken down

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Derivational morphology is messy

A

Quasi-systematicity
Irregular meaning change
Changes of word class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Stemming

A

Chop ends of words (typically) before indexing
- Remove suffixes, possibly prefixes

studies -> sutdi
studying -> study

Language dependent, often heuristic, crude

May yield forms that are not words
automate, automatic -> automat

Neutralises inflections and some derivation

Merge related words into conflation groups

Useful for some applications, but may also be aggressive and conflate some unrelated groups
e.g. experiment, experience -> experi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

conflation groups

A

The groups stemming merges related words into

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Inflection

A

The modification of a word to express different grammatical categories / roles:

tense
person
number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Under-stemming

A

Failure to conflate related forms
divide -> divid
division -> divis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Over-stemming

A

conflates unrelated forms
neutron, neutral -> neutr

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Porter Stemmer

A

One of the most common algorithms for stemming in English

Rule-based = implement specific patterns

readily available in many languages

Main idea: “model” what the endings look like and strip them

In English, patterns depend in particular on ‘syllables’ at the end of the word and how long the word is

Step 1: Get rid of plurals and -ed or -ing suffixes
eed -> ee, agreed -> agree
If m > 0

Step 2: Turn terminal y to i when there is another vowel in the stem If m > 0

Step 3: Map double suffixes to single ones
If m > 0

Step 4: Deal with suffixes, -full, -ness, etc…
If m > 0

Step 5: Take off -ant -ence, etc…
If m > 1

Step 6: Tidy up (remove -e and double letters)
if m > 1 and last letter is e, remove.
probate -> probat

if m > 1 and last letter is l or d, single letter
controll -> control

Other rules can be applied
- conflict resolution, if more rules apply, use on with longest matching suffix for the given word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Spelling error probabilities

A

0.05% in carefully edited newswire

26% in web queries

On average, 1 word in every tweet is misspelled

80% errors are 1 single-error mistakes

Almost all errors within edit distance 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Types of spelling error

A

Non-word errors
graffe -> giraffe

Real-word errors
piece -> peace This needs context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Non-word spelling errors

A

Non-word errors
graffe -> giraffe

17
Q

Real- word spelling errors

A

piece -> peace This needs context

18
Q

Spelling Error Mistakes

A

Insertion (ther)
Deletion (th)
Substitution (thw)
Transposition (teh)

19
Q

Causes of spelling errors

A

Typographical
- Keyboard related errors (80% novice, 51% overall)
- Immediately adjacent keys (50% of novice, 31% of all)

Cognitive
- Phonetic - seperate -> separate
- Homophones - there -> their, piece -> peace

Optical character recognition
- “D” -> “O”, “ri” -> n

20
Q

Spelling error detection

A

Non-word errors
Any word not in a dictionary is an error
- The larger the dictionary the better … up to a point
- (The web is full of mis-spellings, so the web isn’t necessarily a good option)

21
Q

Generating spelling error candidates

A

Take the four common errors
Insertion (ther)
Deletion (th)
Substitution (thw)
Transposition (teh)
And generate candidates that are <= k edit distance from the error

Can compute the fast using a Levenshstein Finite State Transducer

Or having a precomputed list of possible corrections

22
Q

Information Retrieval paradigm

A

We’d like to get the best of something, instead of finding the best we:
Find a (very) good candidate set (spelling candidates k errors<2)
Find the K best amongst them and return them as the best

These may not actually be the best

23
Q

Methods to estimate P(x|w) for noisy channel / error probability model

A

Simple local factors - consider the most important factors predicting an, insertion, deletion, transposition
- Unequal weightings attached to different editing operations
- Insertion and deletion probabilities are conditioned on context. The probability of inserting or deleting a character is conditioned on the letter appearing immediately to the left of that character
- Positional Information, the position in the string where the edit occurs

24
Q

How does Google do spellchecking?

A

Google figures out possible misspellings and their likely correct spellings by using words it finds while searching the web and processing user queries

25
Q

How to find candidates for real world spelling errors

A

For each word w, generate a candidate set:
- similar pronounciations
- similar spellings
- include w in the set

Choose best candidate
- Noisy Channel view of spell errors
- Context-sensitive - so have to consider whether the surrounding words “make sense”

26
Q

Porter Stemmer Notation - C

A

A list of c (consonants) longer than 1

27
Q

Porter Stemmer Notation - V

A

A list of v (vowels) longer than 1

28
Q

Porter Stemmer Notation - []

A

[C] is the arbitrary presence of a consonant

29
Q

Porter Stemmer Notation - (condition)m

A

The condition repeated m times

30
Q

Porter Stemmer Pattern words are decomposed into

A

Cm[V]

m is then referred to as the words measure