Words Flashcards

1
Q

What are some problems with natural langauge?

A

There is lots of ambiguity from identical word forms

There is also dependency on punctuation or intonation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What types of texts exist?

A

Formal News

Polemic News (argumentative)

Speech

Historic, Poetic, Musical

Social Media

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a sentence?

A

A unit of written language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is an utterance?

A

It is a unit of spoken language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a word form?

A

It is the inflected form as it appears in the corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a lemma?

A

It is an abstract form shared by word forms having the same stem, POS, word sense

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are function words?

A

Indicate the grammatical relationship between terms but have little topical meaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are types?

A

They are a number of distinct words in a corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are tokens?

A

It is the collection of all words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some lexical analysis steps we can take?

A

Stripping punctuation, folding cases, removing function words, lemmatising and stemming text, taking an index for each of the words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How much of text is function words?

A

They account for up to 60 percent of text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does repetition signal?

A

It signals intention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What do wordclouds provide?

A

They provide visual representation of statistical summary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is tokenization?

A

It is the process of turning a stream of characters into a sequence of words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a token?

A

A token is a lexical construct that can be assigned grammatical and semantic roles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a naive solution to tokenization?

A

Break on spaces and punctuation - too simple for general case, is useful piece of information for parses, helps to indicate sentence boundaries

17
Q

What are some tokenization issues?

A

Punctuation can be internal, for example abbreviations, prices, times etc

We can have multiword expressions (New York, rock ‘n’ roll)

Clitic contractions (we’re, I’m, etc.)

Numeric expressions

18
Q

What tokenization method is better than splitting on spaces and punctuation?

A

Pattern tokenizaation

19
Q

What is pattern tokenization?

A

It is the use of regular expressions or other pattern matching styles to define tokenization rules

20
Q

What are some problems with pattern tokenization?

A

Rules will be corpus specific and probably overfitted to achieve accuracy for a very specific task

21
Q

Besides splitting on punctuation/spaces and pattern tokenization, what is another way to tokenize text?

A

We can learn common patterns from the corpus itself, or a similar training corpus

Words can be split into subword units (morphemes, significant punctuation)

22
Q

How can we split words?

A

Lemmatization is a way of determining words with different superficial forms but have the same root

Words are composed of subword units called morphemes

Word parts with recognised meanings can have an affix (stems + affixes)

23
Q

What is stemming?

A

Stemming is a lemmatizaation process that removes suffixes by applying a sequence of rewrites

24
Q

What is a popular stemmer?

A

The Porter Stemmer

25
Q

What are some issues with the Porter Stemmer?

A

It is very language specific and is very crude

26
Q

What is BPE?

A

Byte-Pair Encoding

27
Q

How does BPE work?

A

It works by starting with a vocabulary which is the set of individual characters and tries to learn new tokens. This is repeated k times examining the corpus.

28
Q

What happens each time BPE repeats?

A

It selects the two symbols that are most frequently adjacent. (A, B)

It adds the new merged symbol to the vocabulary (AB)

It replaces every adjacent A B in the corpus with the new AB

29
Q

What happens when BPE is run with thousands of merges on a very large corpus?

A

It represents most words as full symbols and only the rare words and unknown words will have to represented by their parts

30
Q

Explain what the image shows

A

It shows BPE in process. We start with the vocabulary and the word end. It then looks for the most common pair of adjacent tokens, which is ‘er’. We then create the ‘er’ token and rewrite any appearances of this token with the combined token, and then repeat this process. This continues until we have enough symbols

31
Q

What are some alternatives to BPE?

A

WordPiece - Is based on an n-gram language model using multiple-adjacent words as single tokens

SentencePiece - extends these as a simple and language independent text tokenizer

32
Q

How do we compute if words are similar?

A

Levenshtein Distance

Minimum Edit Distance

33
Q

How does Levenshtein distance work?

A

It is the shortest sequence of edits to transform one string into another. In the image, we make 3 substitutions, a delete and an insert, so we have a total of 5 edits

34
Q

How does Minimum Edit Distance work?

A

We works by writing down the set of edits needed to go from one word to another. Its applications include plagiarism analysis, alignments in parallel corpora, similarities in different generations for software updates

35
Q

How can the Levenshtein metric be calculated?

A

It can be calculated recursively