CZ4034 Flashcards
(125 cards)
What is term-document incidence matrix?
Limitation? How to improve?
One-hot encoding or basically a 0/1 vector for each term/word
ex. “Anthony” - 110100 (appears in doc 1, 2, 4)
Matrix of the terms will be extremely sparse (more 1s than 0s)
Better representation: only record the 1s
Structured vs. Unstructured Data
Structured data tends to refer to information in “tables”
Unstructured data is information that doesn’t lend itself to the kind of table formatting required by a relational database. Can be textual or non-textual (such as audio, video, and images)
What is inverted index?
For each term t, store a list of all documents (docID) containing t
Ex. “Caesar” - [2 4 5 6 16 57] (postings)
if “Caesar” in new doc 14, must add 14 between 6 and 16 (LIST MUST BE SORTED)
Query Optimization: What is the best order for query processing?
Process in order of increasing frequency
for AND, the smallest set should be on the left (executed first) because what is not in the smallest set should be ignored later
What is Biword Index?
Ex. “Friends, Romans, Countrymen”
Solution 1 to Phrase Queries:
Index every consecutive pair of terms in text as a phrase
The example will generate biwords:
“friends romans” and “romans countrymen”
What is Positional Index?
Solution 2 to Phrase Queries:
In the postings, store (for each term) the position(s) in which tokens of it appear
<term: doc1: pos1, pos2 …; doc2: pos1, pos2 …;>
angels: 2: {35, 174}; 4: {12, 22}; 7: {17};
fools: 1: {1, 17, 74}; 4: {8, 78};
How to process a phrase query using positional index? (2)
ex. “to be or not to be”
- Extract the inverted index entries for each distinct term: to, be, or, not
- Merge their doc:position list to enumerate all positions with “to be or not to be”
Problem with positional index? Despite of this, why is it still standardly used?
- Expands postings storage substantially
- Standardly used because of power and usefulness of phrase and proximity queries
Lossless vs. Lossy compression?
What compression are the following steps: case folding, stopwords removal, stemming?
Lossless = all info is preserved
Lossy = some info discarded (case folding, stopwords removal, stemming)
Formula for Heaps’ Law?
What does the log-log plot of M vs. T suggest?
M = kT^b
M is size of vocabulary and T is # of tokens in the collection
The plot suggests that in the initial docs, collections size will increase. However, the more docs added, the less distinct vocabulary that will be added to the collection (slow to increase)
Zipf’s Law and consequences?
the ith most frequent term has frequency proportional to 1/i
- given a natural language corpus, the freq of any word is inversely proportional to its rank in the freq table
- thus the most freq word will occur approximately 2x as often as the 2nd most freq word, 3x as often as the 3rd most freq word, etc
What are fixed-width terms in dictionary? Why is it wasteful?
Setting an array of fixed-width for a term/word entry
Wasteful bcs not all terms need that much space (like one-letter words). Some super-long words may need even more space
Explain what is “Dictionary as string with pointers to every term”. Problem?
Storing dictionary as long string of characters (instead of table). Pointer to next word shows end of current word.
But difficult to find those words in the dictionary
What is blocking in dictionary?
Storing dictionary as long string of characters
where pointers is stored to every kth term string (using term length to separate diff entries of dict)
Ex. 7systile9syzygetic8syzygial…
What is blocking+front coding in dictionary?
In addition to blocking, stores consecutive entries that share a common prefix
Ex. 8automata8automate9automatic10automation becomes 8automat*a1<>e2<>ic3<>ion
What is Postings Compression?
Storing gaps of docID instead of the docID itself bcs it has fewer bits
Ex. docID 283042 283043 can be stored with gaps 1 instead
What is Tokenization?
Give example of language issues
Breaking down a sequence of text into smaller parts (token = an instance of a sequence of characters)
French: L’ensemble (1 token or 2?)
German: noun compounds are not segmented
Chinese/Japanese: no space
Arabic: written right to left, but with numbers left to right
What are Stopwords?
Words to exclude from dictionary, the commonest words that have little semantic content
ex. the, a, and, to, be
Give examples of cases where stopwords should not be removed (3)
- Phrase queries (King of Denmark)
- Song titles.. etc (Let it be, To be or not to be)
- Relational queries (flights to London)
What is Normalization? Examples
A mapping that maps all the possible spelling permutations to the goal standard form that you are going to save in the dictionary
“Normalize” words in indexes text as well as query words into the same form
Define equivalence classes of terms:
- match U.S.A. and USA -> USA
- match résumé and resume (often best to normalize to a de-accented term)
- match 7月30日 and 7/30
What is Casefolding? Issues?
Reduce all letters to lowercase
It is often best to lowercase everything but there may be some exceptions
ex. WHO vs who, Polish vs. polish
What is Lemmatization?
A type of normalization that does a ‘proper’ reduction to dictionary headword / base form
Ex. am, are, is -> be
car, cars, car’s, cars’ -> car
What is Stemming? Problem?
Crude affix chopping (reduce terms to their ‘roots’ before indexing) (language dependent)
Ex. automates, automatic, automation -> automat
It can completely flip meaning/polarity
Ex. stuffed-> stuff, slimy-> slim, flourish-> flour
Lemmatization vs. Stemming when to use which
In general, if accuracy still high enough, use stemming (bcs very powerful, reduce dictionary much more)
If stemming not good for analysis (in cases where POS tag is important), use lemmatization