CZ4034 Flashcards

Question

Lemmatization Pro and Con

Answer 1

+ improves concept extraction through text normalization (noun/verb inflection elimination) Ex. eats_burger, ate_burger, eat_burgers, eat_the_burger - may lead to misunderstanding Ex. park -> parking, develop -> developing/developed, kick_ball -> kick_balls

Answer 2

Porter's algorithm (set of rules to replace end of a word with sth else) *(example in tutorial)* doesn't work all the time but at least as good as other stemming options

Answer 3

Queries where you don't get a specific word/phrase from user; instead u get a query containing * Ex. mon* -> find all docs with word beginning with "mon" Easy retrieval with binary tree (or B-tree)

Answer 4

An easy way to handle wildcard queries by adding rotations Ex. hello -> hello$, ello$h, llo$he, lo$hel, o$hell (will help find * in front, end, middle of text)

Answer 5

1. Quadruples lexicon size (more memory) 2. Takes a long time (expensive query execution)

Answer 6

- correcting documents being indexed (mapping between correct word in dictionary and misspelled word in document is done correctly) - correcting user queries to retrieve "right" answers (whatever you do on dictionary, you must do on query)

Answer 7

1. Isolated word calculate distance between any misspelled word with a correct word to decide what should replace it 2. Context-sensitive look at surrounding words to decide if word is correct or not (statistics, machine learning)

Answer 8

1. Retrieving documents indexed by correct spelling is computationally expensive 2. May lead to wrong normalization 3. May produce too many alternatives Instead, can suggest alternative queries with correct spelling "Did you mean: ...." Only search if user clicks on the alternative query

Answer 9

the minimum # of operations (insert, delete, replace, transposition) to convert one string to another ex. From dof to dog is 1

Answer 10

Adding weights to common mistakes (OCR errors, Keyboard errors, etc) The cost of replacing certain letters may have higher cost

Answer 11

Expensive and Slow to compute edit distance to every dictionary term Use n-gram overlap (Ex. trigrams) november - (nov, ove, vem, emb, mbe, ber) december - (dec, ece, cem, emb, mbe, ber) 3/6 of terms overlap -> normalize using Jaccard coefficient

Answer 12

Jacard distance (D) = how different 2 sets are Jacard coefficient (J) = how similar 2 sets are always add up to 1 or 100

Answer 13

Translate any word/token into a 4-character reduced form Not very useful for IR (many words that are different end up having the same code)

Answer 14

+ Good for expert users with precise understanding of their needs and the collection - Not good for majority of users as they may be incapable of writing Boolean queries (may be too much work) - Either too few or too many results (and not ranked on relevancy)

Answer 15

A number between [0,1] that measures how well document and query 'match'

Answer 16

- It doesn’t consider term frequency - Rare terms in a collection are more informative than frequent terms but Jaccard doesn’t consider document frequency

Answer 17

How many times a term t occurs in a document d

Answer 18

if tf_t,d > 0, w_t,d = 1 + log tf_t,d and 0 otherwise used to dampen growth of number of occurrences

Answer 19

The number of documents that contain term t (measure of popularity) Frequent terms (ex. stopwords) are less informative than rare terms We want a high weight for rare terms as they are likely to be relevant to query

Answer 20

idf_t = log (N/df_t) log is used to dampen effect of idf idf affects the ranking of documents for queries with at least 2 terms

Answer 21

The number of occurrences of t in the collection, counting multiple occurrences

Answer 22

w_t,d = (1+log tf_t,d) x log(N/df_t) Score(q,d) = sum tf.idf_t,d over terms t present in both the query q and the document d Score will need to be normalized so that it's between [0,1]

Answer 23

A model that represents documents and queries as vectors in high-dimensional space Terms are axes of the space Rank documents according to their (angle) proximity to the query in this space

Answer 24

The dot product of length-normalized vectors q and d numerator = dot product of q and d denominator = sqrt sum square of q * sqrt sum square of d

Answer 25

1. Represent the query as a weighted tf-idf vector 2. Represent each document as a weighted tf-idf vector 3. Compute the cosine similarity score for the query vector and each document vector 4. Rank documents with respect to the query by score 5. Return the top K (e.g., K = 10) to the user

Answer 26

1. The vector space model tells us what is similar to what BUT doesn’t tell us what is what 2. Only implements one kind of reasoning (analogical reasoning) but there are many other kinds of reasoning (e.g., inductive and deductive reasoning)

Answer 27

Sparsity, too big or tall matrix Solution: Dimensionality Reduction (feature selection, TSVD, random projections)

Answer 28

Factorize the matrix Discard some singular values of eigenvalue matrix (start from bottom as it is ranked by importance) Rebuild the matrix Note that we are projecting original space into smaller one where relative distances/angles are preserved

Answer 29

To improve top K document retrieval: 1. Index Elimination 2. Champion Lists 3. Static Quality Scores 4. Cluster Pruning

Answer 30

Before starting dot product calculation, decide which documents to do dot product with (docs containing at least one query term) or advanced: a.) only consider **high-idf query terms** (don't consider stop words as they contribute little to the scores) b.) only consider docs containing **many query terms** = only compute scores for docs containing several of the query terms (ex. at least 3 out of 4) -> the docs will have high cosine score

Answer 31

**Precompute** for each term t in the dictionary, the r docs of highest weight (most relevant) in t's postings = only consider docs with **high weights for each term** This is done prior to query, only done once (will save a lot of time at index build time)

Answer 32

Since r must be pre-determined, say I set K to 10 (meaning I present to user the top 10 relevant documents). However, I found that there are only 9 documents (r < K)

Answer 33

1. Relevance (modeled by cosine scores) 2. Authority (query-independent property: paper with many citations/twitter account with many followers) We can model authority (assign a quality score in [0,1])

Answer 34

net-score(q,d) = g(d) + cosine(q,d) where g(d) is authority and cosine(q,d) is relevance other linear combinations are possible

Answer 35

For each term, maintain 2 postings lists called **high** (champion list) and **low** When traversing postings on a query, only traverse high lists first - If we get more than K docs, select the top K and stop - Else proceed to get docs from the low lists (docs with less count in case r < K)

Answer 36

1. Given query Q, find its nearest leader L (randomly/using heuristics) 2. Seek K nearest docs from among L's followers only need to consider **leader and it's followers**

Answer 37

Postings for each field value to allow user to search for specific parameters of metadata (similar to how search folders) A metadata may contain several fields (ex. author, title, language, year of publication, etc)

Answer 38

A zone is a region of the doc that can contain an arbitrary amount of text (ex. title, abstract, body, references) Build inverted indexes on zones as well to *allow user to query for a specific zone* term.abstract / term.title / term.author

Answer 39

Break postings up into a hierarchy of lists (most important (tier 1),...,least important) If not enough documents retrieved from tier 1, go down each tier tier 1 could be 3/4 term from query is in doc tier 2 could be 2/4, tier 3 could be 1/4 etc

Answer 40

1. Run the query as a phrase query ('rising interest rates') 2. If

Answer 41

Fraction of retrieved docs that are relevant P(relevant|retrieved) = tp/(tp+fp)

Answer 42

Fraction of relevant docs that are retrieved P(retrieved|relevant) = tp/(tp+fn)

Answer 43

Combined measure that assesses precision/recall tradeoff (sweet spot) F1 = 2PR / (P+R)

Answer 44

Set a rank threshold K Compute % relevant in top K Ignores documents ranked lower than K example. relevant [yes, no, yes, no, yes] P@1 = 1, P@2 = 1/2, P@3 = 2/3 P@4 = 2/4, P@5 = 3/5

Answer 45

Average precision = Average of P@K MAP = average precision across multiple queries/rankings MAP is macro-averaging: each query counts equally MAP assumes user is interested in finding many relevant documents for each query

Answer 46

Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks DCG = r1 + r2/log2 + r3/log3 +...+ rn/logn = r1 + sum from i=2 to p (ri / log i)

Answer 47

nDCG = DCG/IDCG at rank n Ideal ranking (IDCG) is sorting the documents starting from highest relevance level, then calculating the DCG

Answer 48

1. Need annotators (humans) humans are expensive, inconsistent, decaying in value as documents/query mix evolves, not always representative of "real users" 2. Recall is difficult to measure on the web

Answer 49

Have more than one person labeling the data and calculate what is the agreement between the two using Kappa Score

Answer 50

Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ] where P(A) is proportion of time judges agree and P(E) is what agreement would be by chance P(E) = P(nonrelevant)^2+P(relevant)^2 Kappa = 1 for total agreement, 0 for chance agreement - If above 0.8, considered as good agreement

Answer 51

User feedback on relevance of docs in initial set of results – User issues a (short, simple) query – The user marks some results as relevant or non-relevant – The system computes a better representation of the information need based on feedback – Relevance feedback can go through one or more iterations

Answer 52

It uses the vector space model to pick a relevance feedback query. New (opt) query moves toward relevant documents (positive feedback) and away from irrelevant documents (negative feedback) qm = alpha x q0 + beta x A - gamma x B A = averaged positive feedback vector B = averaged negative feedback vector

Answer 53

Relevance feedback is most useful for increasing recall in situations where recall is important Positive feedback is more valuable than negative (set γ < b)

Answer 54

A1: User has sufficient knowledge for initial query A2: Relevance prototypes are “well-behaved” - Term distribution in relevant documents will be similar - Term distribution in non-relevant documents will be different from those in relevant documents

Answer 55

1. 2 big assumptions A1 and A2 2. requires a lot of time from user to select relevant documents. users are often reluctant to provide explicit feedback 3. natural language is ambiguous ("break" has multiple meanings) 4. inefficient: high cost, partial solution user would prefer to revise and resubmit query rather than giving relevant feedback every time

Answer 56

For each term, t, in a query, expand the query with synonyms and related words of t from the thesaurus - Generally increases recall but may decrease precision - A high cost of manually producing a thesaurus

Answer 57

Supervised learning uses labeled training data, and unsupervised learning does not

Answer 58

1. conditional independence assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(X1,...,X3|C) = P(X1|C) x P(X2|C) x P(X3|C) 2. P(cj) = N(C=cj) / N = # of docs belonging to class cj / # of docs in training data 3. P(xi|cj) = [N(Xi=xi, C=cj)+1] / [N(C=cj)+|V|] = [# of occurences of word xi in docs belonging to class cj + 1] / [# of words in docs belonging to class cj + size of vocab]

Answer 59

Advantages: – Fast learning – Simple, easy to implement – No curse of dimensionality Disadvantages: – Can perform poorly if assumptions do not hold – Maximum likelihood can overfit data – It is not semantics-aware – Word order not taken into account

Answer 60

- Select a subset of ‘relevant’ terms in vocabulary (or features) - Hypothesis testing statistics using chi square - Reduces training time and can improve generalization (eliminate noise feature, avoids overfitting)

Answer 61

1. used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories 2. 0.999 confidence (... > 10.83) that the class and the term are dependent hence the term should be helpful as a feature

Answer 62

[ N x (AD-CB)^2 ] / [ (A+C) x (B+D) x (A+B) x (C+D) ] A=#(t,c) , B=#(t,!c), C=#(!t,c) , D=#(!t,!c) N = A+B+C+D t = term and c = class

Answer 63

Vector space classification = classify by choosing the nearest centroid kNN = compute distance/similarity between test and all training points. assign the class of the closest/most similar point

Answer 64

+ No feature selection necessary + Scales well with large number of classes + No training necessary + Most cases more accurate than NB - Small changes to one class can have ripple effect - Scores can be hard to convert to probabilities - May be expensive at test time

Answer 65

kNN high variance, low bias (memorizes, will always say 'no' to new object) (non-linear problem) NB low variance, high bias (lazy, will say 'yes' if object has certain feature) (due to assumption, linear separation)

Answer 66

Mapping original feature space to some higher-dimension where training set is separable by a hyperplane Maximize the margin around the separating hyperplane (between the 2 classes) Binary classification algorithm

Answer 67

Categories change over time (a feature may be good back then, but bad today) Good text classification system = how well it protects against concept drift Feature selection NOT good in protecting

Answer 68

- Struggle to handle large sequences of text (ex. when reach end of paragraph, they forget beginning) - Hard to train (exploding gradient problem) - Hard to parallelize as they process words sequentially (cannot speed up training by more GPUs)

Answer 69

- Transformers can be very efficiently parallelized (can train big models) - They have attention, self-attention, positional encoding (easily identify concepts / what is important)

Answer 70

Unsupervised learning process of grouping set of objects into classes of similar objects

Answer 71

Hard clustering: each doc belongs to exactly one cluster Soft clustering: a doc can belong to more than one cluster

Answer 72

1. Select K random docs as seeds (initial centroids) 2. Assign documents to each centroid to form clusters 3. Recalculate centroid of each cluster 4. Repeat step 2-3 until convergence or stopping criterion reached (centroid doesn't change / fixed # iteration / etc)

Answer 73

1. Need to select initial centroids using a good heuristic 2. K number of clusters is unknown. Can have too little or too many clusters

Answer 74

1. Benefit of a doc = cosine similarity to its centroid 2. Total Benefit = sum of individual doc benefits 3. Clustering value = Total Benefit - Total Cost where Total Cost is cost for each cluster * K

Answer 75

1. Do not have to assume number of clusters (any desired k can be obtained by 'cutting' dendrogram at that level) 2. May correspond to meaningful taxonomies

Answer 76

Agglomerative starts with points as individual clusters. At each step, merge closest pair until only one cluster left Divisive starts with one, all-inclusive cluster. At each step, split a cluster until each cluster contains a point

Answer 77

1. Compute proximity matrix where each data point is a cluster 2. Merge the two closest clusters and update proximity matrix 3. Repeat 2 until only single cluster remains

Answer 78

1. Single-link (most cosine-similar) 2. Complete-link (least cosine-similar) 3. Centroid 4. Average-link (average cosine between pairs of elements)

Answer 79

Single-link clustering often produces long, straggly clusters Complete-link often causes outliers

Answer 80

High quality clusters with 1. High intra-class similarity 2. Low inter-class similarity The measured quality of a clustering depends on both the document representation and the similarity measure used

Answer 81

Ratio between the dominant class in the cluster and the size of the cluster Purity = max(1,4,1)/6 = 4/6

Answer 82

To compare the similarity of results between two different clustering methods RI = (A+D) / (A+B+C+D) where A is true positive (# of correctly identified pairs), B is false positive (# of bad cluster pairs), C is false negative (# of pairs that were supposed to be clustered together), D is true negative

Answer 83

discriminative labeling: Find terms or phrases that distinguish a cluster from the other clusters (can use feature selection criteria like chi-square) non-discriminative labeling: Select terms or phrases based solely on information from the cluster itself (ex. terms with high weights in centroids or alternatively use titles/use AI) Non-discriminative methods sometimes select frequent terms that do not distinguish clusters

Answer 84

A paid search ranking in 1996 Search ranking depended on how much you paid, auction for keywords Ads are ranked based on bids (highest bidder gets first rank) Maximizes revenue but no relevance ranking

Answer 85

Advertisers are only charged when someone clicks on their ad. Hence if user search and sees an irrelevant ad, they won't click it but it is free publicity for the advertiser

Answer 86

"Tuning" your webpage to rank highly in the algorithmic search results for select keywords

Answer 87

Dense repetitions of chosen terms (sometimes in same color as background so that it won't be visible to humans) Variant: Meta-Tags (like instagram hashtags) but people use it irresponsibly for free publicity

Answer 88

Serve fake content (different page) for search engine spider/bots and serve true one to a human visitor

Answer 89

1. Informational (want to learn about sth) 2. Navigational (want to go to that page) 3. Transactional (want to do something, ex. shop or download) 4. Gray areas

Answer 90

Fingerprint = a succinct (say 64 bits) digest of the characters in a doc. map an arbitrarily large data to a much shorter bit string When fingerprints of two web pages are equal, test if the pages are identical (duplicates)

Answer 91

Detect near-duplicates (approximate match) by computing similarity using shingles (n-grams) Similarity (= size of intersection / size of union) is estimated based on short sketch

Answer 92

1. Be polite - explicit politeness: only crawl portion based on what webmasters specify in robots.txt - implicit politeness: even with no specification, avoid hitting any site too often 2. Be robust: be immune to spider traps and other malicious behavior from web servers

Answer 93

a search with "bad" results due to maliciously manipulated anchor text ex. search for "evil empire" resulted in Microsoft homepage as top result

Answer 94

Long-term visit rate (pages visited more often are more important) PR(u) = SUM PR(v)/Lv for all v in Bu where Bu is set of pages pointing to u and Lv is number of outgoing links from page v (not counting duplicate links)

Answer 95

1. If there is a hyperlink from page i to page j, then Aij = 1, otherwise Aij = 0 2. Divide each 1 in A by the number of 1’s in its row 3. Multiply the resulting matrix by 1 − α (α: teleporting rate) 4. Add α/N to every entry To prevent from getting stuck in a dead end for a well-defined random walk

Answer 96

* Irreducibility. Roughly: there is a path from any page to any other page * Aperiodicity. Roughly: The pages cannot be partitioned such that the random walker visits the partitions sequentially * Theorem: For any ergodic Markov chain, there is a unique long-term visit rate for each state (steady-state probability distribution)

Answer 97

- Real surfers are not random surfers (back button, bookmarks, search are nonrandom) - Simple PageRank produces bad results for many pages

Answer 98

- Relevance type 1: Hubs = A hub page has a good list of links to pages answering the information need (outlinks) - Relevance type 2: Authorities = An authority page is a direct answer to the information need (inlinks) - A good hub page for a topic **links to** many authority pages for that topic - A good authority page for a topic **is linked to** by many hub pages for that topic

Answer 99

Compute for each page d in the base set a hub score h(d) and an authority score a(d) 1. Initialization: for all d: h(d) = 1, a(d) = 1 2. Iteratively update all h(d) outlinks, a(d) inlinks (adding) 3. After convergence: * Output pages with highest h scores as top hubs * Output pages with highest a scores as top authorities * So we output two ranked lists

Answer 100

1. PageRank can be precomputed, HITS has to be computed at query time (sometimes too expensive) 2. They make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to On the Web, a good hub almost always is also a good authority. The actual difference between PageRank and HITS ranking is not as large as one might expect

Answer 101

Retrieving (getting back) information that is **relevant** to a specific query

CZ4034 Flashcards

(125 cards)