03 - Content-based Filtering Flashcards

1
Q

How to represent textual data?

A
  • Tabular data
  • Vectors
  • Points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does tabular data for Recommender Systems look?

A
  • Every document/instance is represented as one row of a table/matrix or as a vector
  • Every column (feature) corresponds to a term
  • All documents are vectors in a vector space
  • Every term corresponds to one dimension in the vector space
  • Every instance represents one feature vector or point in a n-dimensional vector space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How to measure the relevance of a document?

A
  • Euclidian Distance (L2 Norm)
  • Manhatten Distance (L1 Norm)
  • Cosine Similarity (Distance)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are possible problems with the Euclidian Distance (L2 Norm)?

A
  • Length of document
  • Totally irrelevant documents could be close to each other
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Inverse Document Frequency (IDF)?

A

IDF = log((Number of documents in corpus)/(Number of documents in D, that contain the searched term))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is TF-IDF?

A
  • Weight of a term is based on two factors TF and IDF
  • IDF is specific for corpus D and does not change for new documents
  • TF is specific for each document
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is IDF not ideal?

A

If there are no documents with this term, there is a division by 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a possible extension for IDF?

A

If you have two documents that are equally relevant, you could define more criteria e.g. the age of the document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly