Text Preprocessing Flashcards

1
Q

Corpus

A

A computer-readable collection of text or speech.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Lemma

A

A set of lexical forms having the same stem, the same major part-of-speech, and the same word sense.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Word-form

A

The full inflected or derived form of the word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Word type

A

Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V, the number of types is the vocabulary size |V|.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Work token

A

Tokens are the total number N of running words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Herdan’s Law

A

The larger the corpora, the more word types we find.

|V| = kN^β

The value of β depends on the corpus size and the genre, but typically ranges from .67 to .75.

A.k.a Heaps’ Law

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

datasheet or data statement

A

Specifies properties of a dataset, like:

  • Motivation
  • Situation
  • Language variety
  • Speaker demographics
  • Collection process
  • Annotation process
  • Distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Datasheet properties

Motivation

A

Why was the corpus collected, by whom, and who funded it?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Datasheet properties

Situation

A

When and in what situation was the text written / spoken?

E.g. was there a task? Was the language originally spoken conversation, edited text, social media communication, monologue vs dialogue?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Datasheet properties

Language variety

A

What language (including dialect / region) was the corpus in?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Datasheet properties

Speaker demographics

A

What as, e.g., age or gender of the authors of the text?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Datasheet properties

Collection process

A

How big is the data?

If it is a subsample how was it sampled?

Was the data collected with consent?

How was the data pre-processed, and what metadata is available?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Datasheet properties

Annotation process

A

What are the annotations, what are the demographics of the annotators, how were they trained, how was the data annotated?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Datasheet properties

Distribution

A

Are there copyright or other intellectual property restrictions?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

3 common tasks associated with Text Normalisation

A
  1. Tokenizing (segmenting) words
  2. Normalizing word formats
  3. Segmenting sentences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Tokenization

A

The task of segmenting running text into words.

17
Q

Clitic contractions

A

Contractions marked by apostrophes,

e.g. what're

we're

18
Q

Clitic

A

A part of a word that can’t stand on it’s own, and can only occur when it is attached to another word.

we're, j'ai, l'homme

19
Q

Subwords

A

Subwords can be arbitrary substrings, or they can be meaning-bearing units like the morphemes -est or -er.

20
Q

Morpheme

A

The smallest meaning-bearing unit of a language.

e.g. the word unlikeliest has the morphemes un-, likely, and -est.

21
Q

2 parts of most tokenization schemes

A

A token learner and a token segmenter.

22
Q

token learner

A

takes a raw training corpus (sometimes roughly separated into words, e.g. by whitespace) and induces a vocabulary, a set of tokens.

23
Q

token segmenter

A

takes a raw test sentence and segments it into the tokens in the vocabulary.

24
Q

Byte-pair encoding algorithm

A

A token learner.

It begins with a vocabulary that is just the set of all individual characters.

It then examines the training corpus, chooses the two symbols that are most frequently adjacent (say ‘A’, ‘B’), adds a new merged symbol ‘AB’ to the vocabulary, and replaces every adjacent ‘A’ ‘B’ in the corpus with the new ‘AB’.

It continues to count and merge, creating new longer and longer character strings, until k merges have been done creating k novel tokens.

k is thus a parameter of the algorithm.

The resulting vocabulary consists of the original set of characters plus k new symbols.

25
Q

word normalisation

A

The task of putting words / tokens in a standard format, choosing a single normal form for words with multiple forms like USA and US or uh-huh and uhhuh.

26
Q

lemmatization

A

the task of determining that two words have the same root, despite their surface differences.

e.g. am, are and is have the shared lemma be.

dinner and dinners both have the lemma dinner.

27
Q

Morphology

A

The study of the way words are built up from smaller meaning-bearing units called morphemes.

28
Q

2 broad classes of morphemes

A

“stems” - the central morpheme of the word, supplying the main meaning

affixes - the “additional” meanings of various words.

29
Q

Stemming

A

A naive version of morphological analysis.

This mainly consists of chopping off word-final affixes.

30
Q

Porter Algorithm

A

Simple and efficient way to do stemming, stripping of affixes.

It does not have high accuracy but may be useful for some tasks.

31
Q

Minimum edit distance

A

The minimum edit distance between two strings is defined as the minimum number of editing operations (e.g. insertion, deletion, substitution) needed to transform one string into another.

We can also assign a weight / cost to each of these operations. Levenshtein distance is the simplest, with each of the operations having a cost of 1.

32
Q
A