CZ4034 Flashcards

(125 cards)

1
Q

What is term-document incidence matrix?
Limitation? How to improve?

A

One-hot encoding or basically a 0/1 vector for each term/word
ex. “Anthony” - 110100 (appears in doc 1, 2, 4)

Matrix of the terms will be extremely sparse (more 1s than 0s)
Better representation: only record the 1s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Structured vs. Unstructured Data

A

Structured data tends to refer to information in “tables”
Unstructured data is information that doesn’t lend itself to the kind of table formatting required by a relational database. Can be textual or non-textual (such as audio, video, and images)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is inverted index?

A

For each term t, store a list of all documents (docID) containing t

Ex. “Caesar” - [2 4 5 6 16 57] (postings)
if “Caesar” in new doc 14, must add 14 between 6 and 16 (LIST MUST BE SORTED)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Query Optimization: What is the best order for query processing?

A

Process in order of increasing frequency
for AND, the smallest set should be on the left (executed first) because what is not in the smallest set should be ignored later

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Biword Index?
Ex. “Friends, Romans, Countrymen”

A

Solution 1 to Phrase Queries:
Index every consecutive pair of terms in text as a phrase
The example will generate biwords:
“friends romans” and “romans countrymen”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Positional Index?

A

Solution 2 to Phrase Queries:
In the postings, store (for each term) the position(s) in which tokens of it appear

<term: doc1: pos1, pos2 …; doc2: pos1, pos2 …;>

angels: 2: {35, 174}; 4: {12, 22}; 7: {17};
fools: 1: {1, 17, 74}; 4: {8, 78};

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to process a phrase query using positional index? (2)
ex. “to be or not to be”

A
  1. Extract the inverted index entries for each distinct term: to, be, or, not
  2. Merge their doc:position list to enumerate all positions with “to be or not to be”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Problem with positional index? Despite of this, why is it still standardly used?

A
  • Expands postings storage substantially
  • Standardly used because of power and usefulness of phrase and proximity queries
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Lossless vs. Lossy compression?
What compression are the following steps: case folding, stopwords removal, stemming?

A

Lossless = all info is preserved
Lossy = some info discarded (case folding, stopwords removal, stemming)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Formula for Heaps’ Law?
What does the log-log plot of M vs. T suggest?

A

M = kT^b
M is size of vocabulary and T is # of tokens in the collection

The plot suggests that in the initial docs, collections size will increase. However, the more docs added, the less distinct vocabulary that will be added to the collection (slow to increase)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Zipf’s Law and consequences?

A

the ith most frequent term has frequency proportional to 1/i

  • given a natural language corpus, the freq of any word is inversely proportional to its rank in the freq table
  • thus the most freq word will occur approximately 2x as often as the 2nd most freq word, 3x as often as the 3rd most freq word, etc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are fixed-width terms in dictionary? Why is it wasteful?

A

Setting an array of fixed-width for a term/word entry

Wasteful bcs not all terms need that much space (like one-letter words). Some super-long words may need even more space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain what is “Dictionary as string with pointers to every term”. Problem?

A

Storing dictionary as long string of characters (instead of table). Pointer to next word shows end of current word.
But difficult to find those words in the dictionary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is blocking in dictionary?

A

Storing dictionary as long string of characters
where pointers is stored to every kth term string (using term length to separate diff entries of dict)
Ex. 7systile9syzygetic8syzygial…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is blocking+front coding in dictionary?

A

In addition to blocking, stores consecutive entries that share a common prefix
Ex. 8automata8automate9automatic10automation becomes 8automat*a1<>e2<>ic3<>ion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Postings Compression?

A

Storing gaps of docID instead of the docID itself bcs it has fewer bits

Ex. docID 283042 283043 can be stored with gaps 1 instead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Tokenization?
Give example of language issues

A

Breaking down a sequence of text into smaller parts (token = an instance of a sequence of characters)

French: L’ensemble (1 token or 2?)
German: noun compounds are not segmented
Chinese/Japanese: no space
Arabic: written right to left, but with numbers left to right

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are Stopwords?

A

Words to exclude from dictionary, the commonest words that have little semantic content
ex. the, a, and, to, be

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Give examples of cases where stopwords should not be removed (3)

A
  1. Phrase queries (King of Denmark)
  2. Song titles.. etc (Let it be, To be or not to be)
  3. Relational queries (flights to London)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is Normalization? Examples

A

A mapping that maps all the possible spelling permutations to the goal standard form that you are going to save in the dictionary

“Normalize” words in indexes text as well as query words into the same form
Define equivalence classes of terms:
- match U.S.A. and USA -> USA
- match résumé and resume (often best to normalize to a de-accented term)
- match 7月30日 and 7/30

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is Casefolding? Issues?

A

Reduce all letters to lowercase

It is often best to lowercase everything but there may be some exceptions
ex. WHO vs who, Polish vs. polish

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is Lemmatization?

A

A type of normalization that does a ‘proper’ reduction to dictionary headword / base form

Ex. am, are, is -> be
car, cars, car’s, cars’ -> car

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is Stemming? Problem?

A

Crude affix chopping (reduce terms to their ‘roots’ before indexing) (language dependent)
Ex. automates, automatic, automation -> automat

It can completely flip meaning/polarity
Ex. stuffed-> stuff, slimy-> slim, flourish-> flour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Lemmatization vs. Stemming when to use which

A

In general, if accuracy still high enough, use stemming (bcs very powerful, reduce dictionary much more)

If stemming not good for analysis (in cases where POS tag is important), use lemmatization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Lemmatization Pro and Con
+ improves concept extraction through text normalization (noun/verb inflection elimination) Ex. eats_burger, ate_burger, eat_burgers, eat_the_burger - may lead to misunderstanding Ex. park -> parking, develop -> developing/developed, kick_ball -> kick_balls
26
What is the most popular algorithm for stemming English?
Porter's algorithm (set of rules to replace end of a word with sth else) *(example in tutorial)* doesn't work all the time but at least as good as other stemming options
27
What are Wild-card queries? How to retrieve (data structure)?
Queries where you don't get a specific word/phrase from user; instead u get a query containing * Ex. mon* -> find all docs with word beginning with "mon" Easy retrieval with binary tree (or B-tree)
28
What is Permuterm index?
An easy way to handle wildcard queries by adding rotations Ex. hello -> hello$, ello$h, llo$he, lo$hel, o$hell (will help find * in front, end, middle of text)
29
Problems of permuterm query processing (2)
1. Quadruples lexicon size (more memory) 2. Takes a long time (expensive query execution)
30
2 Principal Uses of Spell Correction
- correcting documents being indexed (mapping between correct word in dictionary and misspelled word in document is done correctly) - correcting user queries to retrieve "right" answers (whatever you do on dictionary, you must do on query)
31
2 Main Flavors of Spell Correction
1. Isolated word calculate distance between any misspelled word with a correct word to decide what should replace it 2. Context-sensitive look at surrounding words to decide if word is correct or not (statistics, machine learning)
32
3 disadvantages of spell correction and suggest an alternative
1. Retrieving documents indexed by correct spelling is computationally expensive 2. May lead to wrong normalization 3. May produce too many alternatives Instead, can suggest alternative queries with correct spelling "Did you mean: ...." Only search if user clicks on the alternative query
33
Edit Distance (Levenshtein)
the minimum # of operations (insert, delete, replace, transposition) to convert one string to another ex. From dof to dog is 1
34
Weighted edit distance
Adding weights to common mistakes (OCR errors, Keyboard errors, etc) The cost of replacing certain letters may have higher cost
35
Problems with Edit Distance. Suggestion
Expensive and Slow to compute edit distance to every dictionary term Use n-gram overlap (Ex. trigrams) november - (nov, ove, vem, emb, mbe, ber) december - (dec, ece, cem, emb, mbe, ber) 3/6 of terms overlap -> normalize using Jaccard coefficient
36
Jacard distance vs. coefficient
Jacard distance (D) = how different 2 sets are Jacard coefficient (J) = how similar 2 sets are always add up to 1 or 100
37
What is Soundex? Is it useful in IR?
Translate any word/token into a 4-character reduced form Not very useful for IR (many words that are different end up having the same code)
38
Pro and Cons of Boolean Search
+ Good for expert users with precise understanding of their needs and the collection - Not good for majority of users as they may be incapable of writing Boolean queries (may be too much work) - Either too few or too many results (and not ranked on relevancy)
39
What is the meaning of Score in ranked retrieval?
A number between [0,1] that measures how well document and query 'match'
40
Issues with Jaccard for scoring
- It doesn’t consider term frequency - Rare terms in a collection are more informative than frequent terms but Jaccard doesn’t consider document frequency
41
What is Term Frequency tf_t,d?
How many times a term t occurs in a document d
42
What is log frequency weight of term t in d?
if tf_t,d > 0, w_t,d = 1 + log tf_t,d and 0 otherwise used to dampen growth of number of occurrences
43
What is Document Frequency df_t?
The number of documents that contain term t (measure of popularity) Frequent terms (ex. stopwords) are less informative than rare terms We want a high weight for rare terms as they are likely to be relevant to query
44
What is idf (inverse document frequency)? Does it effect ranking of docs for queries with 1 term?
idf_t = log (N/df_t) log is used to dampen effect of idf idf affects the ranking of documents for queries with at least 2 terms
45
What is collection frequency of t?
The number of occurrences of t in the collection, counting multiple occurrences
46
What is tf-idf weighting and it's score?
w_t,d = (1+log tf_t,d) x log(N/df_t) Score(q,d) = sum tf.idf_t,d over terms t present in both the query q and the document d Score will need to be normalized so that it's between [0,1]
47
What is the Vector Space Model? What are the axes? How to get proximity?
A model that represents documents and queries as vectors in high-dimensional space Terms are axes of the space Rank documents according to their (angle) proximity to the query in this space
48
What is the cosine similarity for query q and document d?
The dot product of length-normalized vectors q and d numerator = dot product of q and d denominator = sqrt sum square of q * sqrt sum square of d
49
Vector Space Ranking steps (5)
1. Represent the query as a weighted tf-idf vector 2. Represent each document as a weighted tf-idf vector 3. Compute the cosine similarity score for the query vector and each document vector 4. Rank documents with respect to the query by score 5. Return the top K (e.g., K = 10) to the user
50
Limitations of Vector Space Model (2)
1. The vector space model tells us what is similar to what BUT doesn’t tell us what is what 2. Only implements one kind of reasoning (analogical reasoning) but there are many other kinds of reasoning (e.g., inductive and deductive reasoning)
51
Common problems of vector space and how to solve
Sparsity, too big or tall matrix Solution: Dimensionality Reduction (feature selection, TSVD, random projections)
52
What is truncated singular value decomposition TSVD?
Factorize the matrix Discard some singular values of eigenvalue matrix (start from bottom as it is ranked by importance) Rebuild the matrix Note that we are projecting original space into smaller one where relative distances/angles are preserved
53
What are the 4 ways for optimized cosine ranking?
To improve top K document retrieval: 1. Index Elimination 2. Champion Lists 3. Static Quality Scores 4. Cluster Pruning
54
What is Index Elimination? Explain the 2 types
Before starting dot product calculation, decide which documents to do dot product with (docs containing at least one query term) or advanced: a.) only consider **high-idf query terms** (don't consider stop words as they contribute little to the scores) b.) only consider docs containing **many query terms** = only compute scores for docs containing several of the query terms (ex. at least 3 out of 4) -> the docs will have high cosine score
55
What are champion lists? Advantage
**Precompute** for each term t in the dictionary, the r docs of highest weight (most relevant) in t's postings = only consider docs with **high weights for each term** This is done prior to query, only done once (will save a lot of time at index build time)
56
Problem of champion list
Since r must be pre-determined, say I set K to 10 (meaning I present to user the top 10 relevant documents). However, I found that there are only 9 documents (r < K)
57
What are examples of static quality scores (2)?
1. Relevance (modeled by cosine scores) 2. Authority (query-independent property: paper with many citations/twitter account with many followers) We can model authority (assign a quality score in [0,1])
58
What is net score?
net-score(q,d) = g(d) + cosine(q,d) where g(d) is authority and cosine(q,d) is relevance other linear combinations are possible
59
What are high and low lists?
For each term, maintain 2 postings lists called **high** (champion list) and **low** When traversing postings on a query, only traverse high lists first - If we get more than K docs, select the top K and stop - Else proceed to get docs from the low lists (docs with less count in case r < K)
60
What is cluster pruning?
1. Given query Q, find its nearest leader L (randomly/using heuristics) 2. Seek K nearest docs from among L's followers only need to consider **leader and it's followers**
61
What are field/parametric indexes?
Postings for each field value to allow user to search for specific parameters of metadata (similar to how search folders) A metadata may contain several fields (ex. author, title, language, year of publication, etc)
62
What are zone indexes?
A zone is a region of the doc that can contain an arbitrary amount of text (ex. title, abstract, body, references) Build inverted indexes on zones as well to *allow user to query for a specific zone* term.abstract / term.title / term.author
63
What are tiered indexes?
Break postings up into a hierarchy of lists (most important (tier 1),...,least important) If not enough documents retrieved from tier 1, go down each tier tier 1 could be 3/4 term from query is in doc tier 2 could be 2/4, tier 3 could be 1/4 etc
64
What are query parsers?
1. Run the query as a phrase query ('rising interest rates') 2. If
65
What is precision?
Fraction of retrieved docs that are relevant P(relevant|retrieved) = tp/(tp+fp)
66
What is recall?
Fraction of relevant docs that are retrieved P(retrieved|relevant) = tp/(tp+fn)
67
What is F measure (weighted harmonic mean)? (unranked retrieval evaluation)
Combined measure that assesses precision/recall tradeoff (sweet spot) F1 = 2PR / (P+R)
68
What is Precision@K (P@K)? (ranked retrieval evaluation)
Set a rank threshold K Compute % relevant in top K Ignores documents ranked lower than K example. relevant [yes, no, yes, no, yes] P@1 = 1, P@2 = 1/2, P@3 = 2/3 P@4 = 2/4, P@5 = 3/5
69
What is MAP (Mean Average Precision)? Assumption? (ranked retrieval evaluation)
Average precision = Average of P@K MAP = average precision across multiple queries/rankings MAP is macro-averaging: each query counts equally MAP assumes user is interested in finding many relevant documents for each query
70
What is Discounted Cumulative Gain? (ranked retrieval evaluation)
Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks DCG = r1 + r2/log2 + r3/log3 +...+ rn/logn = r1 + sum from i=2 to p (ri / log i)
71
What is Normalized Discounted Cumulative Gain (nDCG) at rank n? (ranked retrieval evaluation)
nDCG = DCG/IDCG at rank n Ideal ranking (IDCG) is sorting the documents starting from highest relevance level, then calculating the DCG
72
2 problems with the rank-based measures?
1. Need annotators (humans) humans are expensive, inconsistent, decaying in value as documents/query mix evolves, not always representative of "real users" 2. Recall is difficult to measure on the web
73
What is Inter-judge (dis)agreement?
Have more than one person labeling the data and calculate what is the agreement between the two using Kappa Score
74
What is the formula for Kappa measure?
Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ] where P(A) is proportion of time judges agree and P(E) is what agreement would be by chance P(E) = P(nonrelevant)^2+P(relevant)^2 Kappa = 1 for total agreement, 0 for chance agreement - If above 0.8, considered as good agreement
75
What is relevance feedback?
User feedback on relevance of docs in initial set of results – User issues a (short, simple) query – The user marks some results as relevant or non-relevant – The system computes a better representation of the information need based on feedback – Relevance feedback can go through one or more iterations
76
What is the Rocchio algorithm? Formula
It uses the vector space model to pick a relevance feedback query. New (opt) query moves toward relevant documents (positive feedback) and away from irrelevant documents (negative feedback) qm = alpha x q0 + beta x A - gamma x B A = averaged positive feedback vector B = averaged negative feedback vector
77
What is relevance feedback useful for?
Relevance feedback is most useful for increasing recall in situations where recall is important Positive feedback is more valuable than negative (set γ < b)
78
Assumptions of relevance feedback Rocchio algorithm (2)
A1: User has sufficient knowledge for initial query A2: Relevance prototypes are “well-behaved” - Term distribution in relevant documents will be similar - Term distribution in non-relevant documents will be different from those in relevant documents
79
4 problems of Rocchio algorithm (Relevance Feedback)
1. 2 big assumptions A1 and A2 2. requires a lot of time from user to select relevant documents. users are often reluctant to provide explicit feedback 3. natural language is ambiguous ("break" has multiple meanings) 4. inefficient: high cost, partial solution user would prefer to revise and resubmit query rather than giving relevant feedback every time
80
What is Thesaurus-based query expansion? (global method = using external source)
For each term, t, in a query, expand the query with synonyms and related words of t from the thesaurus - Generally increases recall but may decrease precision - A high cost of manually producing a thesaurus
81
Supervised vs. Unsupervised learning
Supervised learning uses labeled training data, and unsupervised learning does not
82
Naïve Bayes classifier 1. assumption 2. P(cj) 3. P(xi|cj) with smoothing
1. conditional independence assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(X1,...,X3|C) = P(X1|C) x P(X2|C) x P(X3|C) 2. P(cj) = N(C=cj) / N = # of docs belonging to class cj / # of docs in training data 3. P(xi|cj) = [N(Xi=xi, C=cj)+1] / [N(C=cj)+|V|] = [# of occurences of word xi in docs belonging to class cj + 1] / [# of words in docs belonging to class cj + size of vocab]
83
Naive Bayes Pros and Cons (3 each)
Advantages: – Fast learning – Simple, easy to implement – No curse of dimensionality Disadvantages: – Can perform poorly if assumptions do not hold – Maximum likelihood can overfit data – It is not semantics-aware – Word order not taken into account
84
What is feature selection? How? 2 advantages?
- Select a subset of ‘relevant’ terms in vocabulary (or features) - Hypothesis testing statistics using chi square - Reduces training time and can improve generalization (eliminate noise feature, avoids overfitting)
85
1. What is x2 (chi-square) test? 2. What does it mean when you reject the hypothesis?
1. used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories 2. 0.999 confidence (... > 10.83) that the class and the term are dependent hence the term should be helpful as a feature
86
Formula or 2x2 chi-squared
[ N x (AD-CB)^2 ] / [ (A+C) x (B+D) x (A+B) x (C+D) ] A=#(t,c) , B=#(t,!c), C=#(!t,c) , D=#(!t,!c) N = A+B+C+D t = term and c = class
87
Vector Space Classification vs. kNN
Vector space classification = classify by choosing the nearest centroid kNN = compute distance/similarity between test and all training points. assign the class of the closest/most similar point
88
4 advantages 3 disadvantages of kNN
+ No feature selection necessary + Scales well with large number of classes + No training necessary + Most cases more accurate than NB - Small changes to one class can have ripple effect - Scores can be hard to convert to probabilities - May be expensive at test time
89
kNN vs. NB
kNN high variance, low bias (memorizes, will always say 'no' to new object) (non-linear problem) NB low variance, high bias (lazy, will say 'yes' if object has certain feature) (due to assumption, linear separation)
90
What is support vector machine? What is the goal? Limitation?
Mapping original feature space to some higher-dimension where training set is separable by a hyperplane Maximize the margin around the separating hyperplane (between the 2 classes) Binary classification algorithm
91
What is concept drift? One measure of text classification system? Is feature selection good in protecting against concept drift?
Categories change over time (a feature may be good back then, but bad today) Good text classification system = how well it protects against concept drift Feature selection NOT good in protecting
92
3 Issues with RNN (Recurrent Neural Networks)
- Struggle to handle large sequences of text (ex. when reach end of paragraph, they forget beginning) - Hard to train (exploding gradient problem) - Hard to parallelize as they process words sequentially (cannot speed up training by more GPUs)
93
2 Advantages of Transformers over RNN
- Transformers can be very efficiently parallelized (can train big models) - They have attention, self-attention, positional encoding (easily identify concepts / what is important)
94
What is clustering?
Unsupervised learning process of grouping set of objects into classes of similar objects
95
Hard vs. Soft clustering
Hard clustering: each doc belongs to exactly one cluster Soft clustering: a doc can belong to more than one cluster
96
K-means algorithm
1. Select K random docs as seeds (initial centroids) 2. Assign documents to each centroid to form clusters 3. Recalculate centroid of each cluster 4. Repeat step 2-3 until convergence or stopping criterion reached (centroid doesn't change / fixed # iteration / etc)
97
Issue with k-means (2)
1. Need to select initial centroids using a good heuristic 2. K number of clusters is unknown. Can have too little or too many clusters
98
Given a clustering, what is the Benefit and Total Benefit and Clustering Value
1. Benefit of a doc = cosine similarity to its centroid 2. Total Benefit = sum of individual doc benefits 3. Clustering value = Total Benefit - Total Cost where Total Cost is cost for each cluster * K
99
2 Strengths of Hierarchical Clustering
1. Do not have to assume number of clusters (any desired k can be obtained by 'cutting' dendrogram at that level) 2. May correspond to meaningful taxonomies
100
Agglomerative vs. Divisive clustering
Agglomerative starts with points as individual clusters. At each step, merge closest pair until only one cluster left Divisive starts with one, all-inclusive cluster. At each step, split a cluster until each cluster contains a point
101
Hierarchical Agglomerative Clustering (HAC) Algorithm
1. Compute proximity matrix where each data point is a cluster 2. Merge the two closest clusters and update proximity matrix 3. Repeat 2 until only single cluster remains
102
4 ways of defining closest pair of clusters
1. Single-link (most cosine-similar) 2. Complete-link (least cosine-similar) 3. Centroid 4. Average-link (average cosine between pairs of elements)
103
Production of single-link and complete-link clustering
Single-link clustering often produces long, straggly clusters Complete-link often causes outliers
104
What is good clustering? (2)
High quality clusters with 1. High intra-class similarity 2. Low inter-class similarity The measured quality of a clustering depends on both the document representation and the similarity measure used
105
What is purity? Example: A cluster has 1 point in class A, 4 points in class B and 1 point in class C, what is the purity?
Ratio between the dominant class in the cluster and the size of the cluster Purity = max(1,4,1)/6 = 4/6
106
What is Rand index? Formula?
To compare the similarity of results between two different clustering methods RI = (A+D) / (A+B+C+D) where A is true positive (# of correctly identified pairs), B is false positive (# of bad cluster pairs), C is false negative (# of pairs that were supposed to be clustered together), D is true negative
107
What is discriminative labeling and non-discriminative labeling?
discriminative labeling: Find terms or phrases that distinguish a cluster from the other clusters (can use feature selection criteria like chi-square) non-discriminative labeling: Select terms or phrases based solely on information from the cluster itself (ex. terms with high weights in centroids or alternatively use titles/use AI) Non-discriminative methods sometimes select frequent terms that do not distinguish clusters
108
What is Goto?
A paid search ranking in 1996 Search ranking depended on how much you paid, auction for keywords Ads are ranked based on bids (highest bidder gets first rank) Maximizes revenue but no relevance ranking
109
Trouble with paid search ads (Goto)
Advertisers are only charged when someone clicks on their ad. Hence if user search and sees an irrelevant ad, they won't click it but it is free publicity for the advertiser
110
What is Search Engine Optimization (SEO)?
"Tuning" your webpage to rank highly in the algorithmic search results for select keywords
111
What is keyword stuffing? What is the variant?
Dense repetitions of chosen terms (sometimes in same color as background so that it won't be visible to humans) Variant: Meta-Tags (like instagram hashtags) but people use it irresponsibly for free publicity
112
What is cloaking and doorway pages?
Serve fake content (different page) for search engine spider/bots and serve true one to a human visitor
113
How can user's search query be categorized to satisfy their needs? (4)
1. Informational (want to learn about sth) 2. Navigational (want to go to that page) 3. Transactional (want to do something, ex. shop or download) 4. Gray areas
114
What is fingerprint? What can it be used to detect?
Fingerprint = a succinct (say 64 bits) digest of the characters in a doc. map an arbitrarily large data to a much shorter bit string When fingerprints of two web pages are equal, test if the pages are identical (duplicates)
115
What is near-duplicate detection?
Detect near-duplicates (approximate match) by computing similarity using shingles (n-grams) Similarity (= size of intersection / size of union) is estimated based on short sketch
116
What any crawler must do (2)
1. Be polite - explicit politeness: only crawl portion based on what webmasters specify in robots.txt - implicit politeness: even with no specification, avoid hitting any site too often 2. Be robust: be immune to spider traps and other malicious behavior from web servers
117
What is Google bomb
a search with "bad" results due to maliciously manipulated anchor text ex. search for "evil empire" resulted in Microsoft homepage as top result
118
What is Page Rank?
Long-term visit rate (pages visited more often are more important) PR(u) = SUM PR(v)/Lv for all v in Bu where Bu is set of pages pointing to u and Lv is number of outgoing links from page v (not counting duplicate links)
119
What are the steps of transition probability matrix? Purpose of teleportation rate?
1. If there is a hyperlink from page i to page j, then Aij = 1, otherwise Aij = 0 2. Divide each 1 in A by the number of 1’s in its row 3. Multiply the resulting matrix by 1 − α (α: teleporting rate) 4. Add α/N to every entry To prevent from getting stuck in a dead end for a well-defined random walk
120
What are the 2 conditions of Ergodic Markov Chain? What is the Theorem?
* Irreducibility. Roughly: there is a path from any page to any other page * Aperiodicity. Roughly: The pages cannot be partitioned such that the random walker visits the partitions sequentially * Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state (steady-state probability distribution)
121
PageRank issues (2)
- Real surfers are not random surfers (back button, bookmarks, search are nonrandom) - Simple PageRank produces bad results for many pages
122
What are 2 types of relevance? What makes them good?
- Relevance type 1: Hubs = A hub page has a good list of links to pages answering the information need (outlinks) - Relevance type 2: Authorities = An authority page is a direct answer to the information need (inlinks) - A good hub page for a topic **links to** many authority pages for that topic - A good authority page for a topic **is linked to** by many hub pages for that topic
123
Steps for Hyperlink-induced topic search (HITS)
Compute for each page d in the base set a hub score h(d) and an authority score a(d) 1. Initialization: for all d: h(d) = 1, a(d) = 1 2. Iteratively update all h(d) outlinks, a(d) inlinks (adding) 3. After convergence: * Output pages with highest h scores as top hubs * Output pages with highest a scores as top authorities * So we output two ranked lists
124
PageRank vs. HITS
1. PageRank can be precomputed, HITS has to be computed at query time (sometimes too expensive) 2. They make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to On the Web, a good hub almost always is also a good authority. The actual difference between PageRank and HITS ranking is not as large as one might expect
125
What is Information Retrieval?
Retrieving (getting back) information that is **relevant** to a specific query