Words Flashcards
(35 cards)
What are some problems with natural langauge?
There is lots of ambiguity from identical word forms
There is also dependency on punctuation or intonation
What types of texts exist?
Formal News
Polemic News (argumentative)
Speech
Historic, Poetic, Musical
Social Media
What is a sentence?
A unit of written language
What is an utterance?
It is a unit of spoken language
What is a word form?
It is the inflected form as it appears in the corpus
What is a lemma?
It is an abstract form shared by word forms having the same stem, POS, word sense
What are function words?
Indicate the grammatical relationship between terms but have little topical meaning
What are types?
They are a number of distinct words in a corpus
What are tokens?
It is the collection of all words
What are some lexical analysis steps we can take?
Stripping punctuation, folding cases, removing function words, lemmatising and stemming text, taking an index for each of the words
How much of text is function words?
They account for up to 60 percent of text
What does repetition signal?
It signals intention
What do wordclouds provide?
They provide visual representation of statistical summary
What is tokenization?
It is the process of turning a stream of characters into a sequence of words
What is a token?
A token is a lexical construct that can be assigned grammatical and semantic roles
What is a naive solution to tokenization?
Break on spaces and punctuation - too simple for general case, is useful piece of information for parses, helps to indicate sentence boundaries
What are some tokenization issues?
Punctuation can be internal, for example abbreviations, prices, times etc
We can have multiword expressions (New York, rock ‘n’ roll)
Clitic contractions (we’re, I’m, etc.)
Numeric expressions
What tokenization method is better than splitting on spaces and punctuation?
Pattern tokenizaation
What is pattern tokenization?
It is the use of regular expressions or other pattern matching styles to define tokenization rules
What are some problems with pattern tokenization?
Rules will be corpus specific and probably overfitted to achieve accuracy for a very specific task
Besides splitting on punctuation/spaces and pattern tokenization, what is another way to tokenize text?
We can learn common patterns from the corpus itself, or a similar training corpus
Words can be split into subword units (morphemes, significant punctuation)
How can we split words?
Lemmatization is a way of determining words with different superficial forms but have the same root
Words are composed of subword units called morphemes
Word parts with recognised meanings can have an affix (stems + affixes)
What is stemming?
Stemming is a lemmatizaation process that removes suffixes by applying a sequence of rewrites
What is a popular stemmer?
The Porter Stemmer