W5 Neural IR 1 Flashcards

1
Q

Week 5 Outline

A
  • Term representation & Word Embeddings
  • BERT embeddings
  • Machine Learning for IR: learning to rank
  • Transformers for ranking (monoBERT)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Comparison between one-hot/term vector spaces and embeddings?

A

Term Vector Spaces:
sparse, high-dimensional, observable, can be used for exact matching

Embeddings:
Dense, lower dimensional, latent, can be used for in-exact matching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

masked language modelling

A

remove a word from a context and train a neural network to predict the masked word (probability distribution over all words in the vocabulary)

self-supervised: supervised but we don’t need to present the labels ourselves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

transformer architecture

A

sequence-to-sequence (encoder-decoder)

uses self-attention: computes strength of relation between pairs of input words (dot product)

can model long-distance term dependencies because the complete input is processed at once

quadratic complexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

BERT

Make sure you understand BERT and its applications in IR because the teacher really likes BERT

A

neural network model for generating contextual embeddings

Bidirectional Encoder Representations from Transformers

pre-training and fine-tuning: transfer learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

machine learning for ranking

A

most straightforward:
1. learn a probabilistic classifier or regression model on query-document pairs with relevance labels
2. apply to unseen pairs and get a score for each query-document pair
3. rank documents per query by prediction score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

pointwise learning

A

learning the relevance value per query-document pair, then sorting by the predicted values

loss function: sum of squared differences between each true and assigned score and take the average

limation: does not consider relative ranking between items in the same list, only absolute numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

pairwise learning

A

consider pairs of relevant and nonrelevant documents for the same query and minimize the number of incorrect inversions in the ranking

loss function: pair wise hinge
L_hinge = max(0, 1 - (score(q,d_i) - score(q,d_j)))
and sum over all pairs (d_i, d_j) with d_i more relevant than d_j

limitations: every document pair is treated as equally important, but misrankings in higher positions are more severe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

LET OP!

Make sure to know how to calculate pointwise and pairwise, it WILL be an exam question

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

two-stage retrieval

A

use the embeddings model to re-rank the top documents retreived by a lexical IR model

step 1: lexcial retrieval from the whole corpus with BM25 or LM
step 2: re-ranking of top-n retrieved documents with supervised BERT model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

monoBERT

A

two-input classification (cross-encoder)
[CLS, q, SEP, d, SEP]

output is the representation of the CLS token
- used as input to single-layer fully connected neural network
- to obtain the probability that candidate d is relevant to q
- followed by softmax for the relevance classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

BERT ranking and inference set-up

A

training:
1. start with pre-trained BERT model
2. fine-tune on query-document pairs with relevance assessments

inference:
1. retrieve 100-1000 documents with lexical retrieval model
2. apply trained monoBERT to all retrieved q,d pairs and output score
3. for each query, rank the documents by this score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

BERT has a maximum input length of 512 tokens. How to deal with longer documents?

A

truncate input texts, or: split documents into passages => challenges:
- training: labels are given on document level => what to feed the model
- inference: we need to aggregate the scores per passage into a document score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how can we aggregate passage scores?

A

BERT-MaxP: passage score aggregation
- training: treat all passages from a relevant document as relevant and all passages from a non-relevant document as not relevant
- inference: estimate the relevance of each passage, then take the maximum passage score (MaxP) as the document score

PARADE: passage representation aggregation
- training: the same
- inference: aggregate the representations of passages rather than aggregating the scores of individual passages (averaging the [CLS] representation from each passage)

OR use passage-level relevance labels or transformer architectures for long texts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly