Week 7: Text Mining Flashcards
Language Models
Key components include: characters, strings, language (set of strings), corpus (collection of one or more texts)
n-gram
A contiguous sequence of n items, e.g. characters
n-gram Character Model
A probability distribution of n-character sequences using a Markov chain of order n-1.
P(c_1,…,c_n) = \prod_{i=1}^n P(c_i \mid c_{i-2}, c_{i-1})
Language Identification Problem
Identify the language that a text (corpus) is written in.
One approach is to build a 3-gram model and find the most probable language L^{} such that
L^{} = \arg \underset{L}{\max}\left{ P(L \mid c_1,…,c_n)\right}
Spelling Correction
Identify and fix errors in word spelling. Trigrams are often used for this.
Genre Classification
Identify category of text (e.g., news, novel, blogs, etc.). Trigrams are often used.
Smoothing
The process of assigning a non-zero probability to sequences that don’t appear in the training corpus.
Linear Interpolation Smoothing
Combining unigrams, bigrams, and trigrams using linear interpolation:
\hat{P}(c_i \mid c_{i-2}, c_{i-1}) = \lambda_1 P(c_i) + \lambda_2 P(c_i \mid c_{i-1}) + \lambda_3 P(c_i \mid c_{i-2}, c_{i-1})
Word
Sequence of characters
Vocabulary
Set of valid words in a language.
Out-of-vocabulary Words
Words with characters of the language that aren’t valid.
N-gram Word Models
They model sequences of words rather than characters.
As n increases, n-gram models tend to predict better, but the training overhead becomes more significant.
Bayesian Approach
Using spam as an example, a new msg can be classified using Bayes rule:
\underset{c \in \left{ spam, \text{¬spam} \right} }{\arg \max} P(c \mid msg) = \underset{c \in \left{ spam, \text{¬spam} \right} }{\arg \max} P(msg \mid c) P(c)
Using priors,
P(spam) = \frac{\left\lvert spam \right\rvert}{\left\lvert spam \right\rvert + \left\lvert \text{¬spam} \right\rvert}
Term-document Matrix
A 2-D matrix with frequencies of words in a collection of documents. Rows correspond to documents, and columns correspond to terms.
Topic Analysis
Identify topics in text.
Sentiment Analysis
Identify the polarity in text (i.e. positive, negative, or neutral). Can also classify based on emotion (i.e. anger, disgust joy).
Information Extraction
The process of acquiring knowledge by skimming tests and looking for special types of objects as well as relationships between them.
The pipeline is as follows:
1. Tokenisation
2. Complex words
3. Basic groups
4. Complex phrases
5. Structure merging
Tokenisation
Segment the text into tokens.
Complex Words
Recognise complex words such as collocations (i.e. “set up”, “joint venture”) or names (“Great Britain”, “Harry Potter”) appearing as combinations of lexical entries.
For example, company names can be recognised with the regex: Capitalised Word + (“Co”|”Inc”|”Ltd”)
Basic Groups
Chunk the text into units as following:
NG = Noun phrase or group
VG = Verb phrase or group
PR = Preposition
Complex Phrases
Define rules for complex relationships. For example, the following rule describes the creation of a joint venture:
Company + SetUpJointVenture((Company)*)
Structure Merging
Merges terms referring to the same entity.
Regular Expression (Regex)
Defines a search pattern, regex R is associated with a set of L(R) of strings.
Definitions:
^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.
. Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches “abc”, etc., but [a.c] matches only “a”, “.”, or “c”.
[ ] A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches “a”, “b”, or “c”. [a-z] specifies a range which matches any lowercase letter from “a” to “z”. These forms can be mixed: [abcx-z] matches “a”, “b”, “c”, “x”, “y”, or “z”, as does [a-cx-z].
The - character is treated as a literal character if it is the last or the first (after the ^, if present) character within the brackets: [abc-], [-abc]. Backslash escapes are not allowed. The ] character can be included in a bracket expression if it is the first (after the ^) character: []abc].
[^ ] Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than “a”, “b”, or “c”. [^a-z] matches any single character that is not a lowercase letter from “a” to “z”. Likewise, literal characters and ranges can be mixed.
$ Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.
( ) Defines a marked subexpression. The string matched within the parentheses can be recalled later (see the next entry, \n). A marked subexpression is also called a block or capturing group. BRE mode requires ( ).
\n Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is defined in the POSIX standard.[35] Some tools allow referencing more than nine capturing groups. Also known as a back-reference, this feature is supported in BRE mode.
* Matches the preceding element zero or more times. For example, abc matches “ac”, “abc”, “abbbc”, etc. [xyz] matches “”, “x”, “y”, “z”, “zx”, “zyx”, “xyzzy”, and so on. (ab)* matches “”, “ab”, “abab”, “ababab”, and so on.
{m,n} Matches the preceding element at least m and not more than n times. For example, a{3,5} matches only “aaa”, “aaaa”, and “aaaaa”. This is not found in a few older instances of regexes. BRE mode requires {m,n}.
Examples:
.at
matches any three-character string ending with “at”, including “hat”, “cat”, “bat”, “4at”, “#at” and “ at” (starting with a space).
[hc]at
matches “hat” and “cat”.
[^b]at matches all strings matched by .at except “bat”.
[^hc]at
matches all strings matched by .at other than “hat” and “cat”.
^[hc]at
matches “hat” and “cat”, but only at the beginning of the string or line.
[hc]at$
matches “hat” and “cat”, but only at the end of the string or line.
[.] matches any single character surrounded by “[” and “]” since the brackets are escaped, for example: “[a]”, “[b]”, “[7]”, “[@]”, “[]]”, and “[ ]” (bracket space bracket).
s.* matches s followed by zero or more characters, for example: “s”, “saw”, “seed”, “s3w96.7”, and “s6#h%(»>m n mQ”.
Dimensionality
This is a major challenge in probabilistic modelling. As the number of dimensions grows, building models get tougher.