Tokenization - Week 2 Flashcards

1
Q

Tokenization

A

Break input into basic units

Words, numbers, punctuation, emoji

No firm rules, should be consistent with the rest of an NLP system

It’s about knowing when to split, not when to combine
- Avoid over-segmentation

Is only the first step, should be simple, but will affect the rest of the steps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Does whitespace always work for tokenization?

A

No, not all languages use spaces between tokens,

Other complications include right to left / mixed direction languages like arabic

Japanese uses several alphabets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Tokenization things to deal with

A

Hyphens

Should places be one token “San Francisco”

In space of - 1,2 or 3 tokens?

Might tokenise on Lx71

Abbreviated forms - can’t, what’re, I’m
This has a problem because of possessive apostrophe like King’s speech shouldn’t be King is speech

Need to make decisions depending on the context and what you want to do

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Typical tokenisation steps

A
  1. Initial Segmentation
  2. Handling abbreviations and apostrophes
  3. Handling hyphenation
  4. Dealing with (other) special expressions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

All punctuation that is not part of an acronym should be a seperate token?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Types of apostrophe

A

Quotative Markers - ‘All Quiet on the Western Front’

genitive markers - e.g. Oliver’s Book

enclitics
she’s -> she has or she is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Enclitics

A

Abbreviated forms typically of auxiliary verbs (be, will) that is pronounced with so little emphasis that it is shortened and forms part of the preceding word
e.g. I’m, you’re she’s, can’t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Lexical hyphens

A

A true hyphen used in compound words which have made their way into the standard vocabulary (and should be kept)

meta-analysis
multi-disciplinary
self-assessment

Some hyphens are not lexical, like: UK-based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Tokenization special expression examples

A

emails
dates
Numbers (different in different languages)
measures
vehicle license numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Truecasing

A

lowercase words at the beginning of sentences, leave mid-sentence capitalised words as capitalised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Case folding

A

Should we true case, make everything lower case? Some proper-nouns only identifiable with casing, e.g. US might become us

search engines - users usually lowercase everything regardless of the correct case of words. Thus, lowercasing everything seems a practical solution

Identification of entity name (e.g. organisation or people names): preserving capitals would make sense

Also consider special characters, like accents, emojis, etc…
Users might not use an accent when googling something, even if there should technically be one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Language model

A

A function (often probabilistic) that assigns a probability over a piece of text so that ‘natural’ pieces have a larger probability

P(“the the the the the”) - pretty small
P(“the cat in the hat”) - pretty big

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Model

A

A “simplified”, abstract representation of something, often in a computational form
e.g. tossing a coin
e.g. probabilistic model of weather forecast

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

BOW

A

A Bag Of Words representation reduces each document into a bag of words

Bag of terms, bag of tokens, bag of stems

Problems:
- Meaning is lost without order, e.g. negations
- Not all words are equally important
- Meaning is lost without context, introduces ambiguities
- Doesn’t work for all languages

It is, however, efficient

Could skip stop words, rank words based on a metric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Zipf’s Law

A

Frequency of any word in a given collection is inversely proportional to its rank in the frequency table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Luhn’s hypothesis

A

Words with frequency below a low cut-off are rare, and therefore not contributing significantly to the content of the article.

Words exceeding a certain frequency are to considered too common.

17
Q

Vector representation

A

Represent a document as a vector by introducing a vocabulary (set of terms left after pre-processing)

Document is represented by a |V| dimensional vector, where each entry is a weighted representation of a specific term in the dictionary

Millions of dimensions
Most values are 0 - very sparse

18
Q

Vector Representation Weights

A

Incidence (0 or 1)

Frequency
- Almost all documents have many determiners
- Rare terms more descriptive than frequent terms
- Documents have different lengths which affects the counts
- Raw term frequency is not what we want

19
Q
A