Embeddings & Locality sensitive hashing Flashcards

1
Q

What is the key idea of embedding?

A

Convert objects to vectors, based on some notion of similarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Jaccard similarity?

A

It is the ratio of elements in the intersection of two sets and the union of two sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the surprising property of the jaccard similarity?

A

The probability over all permutations of the rows that h(c1) = h(c2) is the same as Sim(c1,c2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is min hashing?

A

It is an algorithm that creates a signature for a matrix based on some hash functions such that the similarity between columns is maintained

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the pseudocode for min hashing?

A

for each row r
for each column c
if c has 1 in row r
for each hash h_i
if h(r) is smaller than M(i,c) then M(i,c) = h_i(r)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is locality sensitive hashing?

A

It is using hashing functions that take locality into consideration. For example, projecting elements on the x-axis is a hashing method that keeps them ordered in order of x value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to find nearest neighbors using LSH?

A

Hash all data points using locality-sensitive hash

All points in the same bucket as query point are candidate near neighbors

Compute distances only to candidate points to find true NN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to resolve collisions and splits when using LSH?

A

To resolve same bucket problem -> compute LSH from different directions. Points are candidate if they are candidate in all hash tables

To resolve split problem -> points are candidate if they are candidates in any of the hash tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the MAJORITY algorithm?

A

Just count each item and find if there is one that appears more than half of the time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the FREQUENT algorithm?

A

Find all elements in a sequence whose frequency exceeds 1/k fraction of the total count.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is sketching?

A

A linear transform of the input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a bloom filter?

A

Given a set of hash functions that map all elements of the universe to a predefined range and a bit vector with the size of the range, to add an element to W, compute the hashes of the element, and then set the corresponding bits to 1.
To test whether an element is in W, compute all the hashes, and sum up the returned bits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the probability of a false positive when using a bloom filter?

A

(1 - (1 - 1/n)^km )^k, where k is the number of hashes, and m is the number of inserts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is count-min sketching?

A

Creates a small summary of a stream as an array of size wxd, where d is the number of hash functions that map values to [1, w]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly