Vector Search Flashcards
(111 cards)
What does FAISS stand for?
Facebook AI Similarity Search
Describe the vector search problem
We have a set of vectors, and for some query vector, we want to find the vectors that are most similar.
What is a Voronoi diagram?
It’s a division of the plane into cells, where each cell is all the space closer to one particular point than any others. It’s that point’s “territory.”
Conceptually, if some query point falls in a Voronoi cell, what does that mean?
It means you know which point is closest.
What does “probing” mean?
It means searching nearby cells, not just the one you landed on.
What does IVF stand for?
InVerted File
What is the IVF strategy?
It’s a strategy for vector search. You pick some set of key points, make a Voronoi diagram of them, and then find the Voronoi cell for your query vector and search all the points in that cell – plus maybe probe some nearby ones.
What does PQ stand for?
Product Quantization
What is the point of PQ?
It’s to make the vectors themselves smaller, so that L2 distance is easier to calculate.
How do you do Product Quantization?
First, chunk the vectors up into subvectors. Then, for a particular set of subvectors, do clustering, so you get centroids. Finally, replace each subvector with the cluster id.
PQ maps from what space size to what space size?
It maps from the original vector space, like R 100, to a space of the number of subvectors, like R4 if you’re breaking up into 4 subvectors.
What are codewords?
They’re the centroids from Product Quantization.
What is a codebook?
The set of all centroids for a particular subvector subspace in Product Quantization.
What is Indexing?
It’s the strategy you use to store your vectors in your vector space – you put them in an index. And then you need to be able to search within that index.
What is a “flat index”?
It’s one where you don’t modify the vectors at all and just store them directly.
How would you lookup in a flat index?
You would literally just do KNN with the entire index.
What does KNN stand for?
K Nearest Neighbors.
How do you do KNN?
Calculate distance between the query vector and every single other vector. Then return the top K closest.
What are the pros and cons of a flat index?
Pros: Search quality is maximized since you are literally just doing KNN. Cons are that this takes forever.
What does ANN stand for?
Approximate Nearest Neighbors
What does LSH stand for?
Locality Sensitivity Hashing
How do you do LSH?
Write some hash function that puts similar vectors in the same hash bucket. Then at lookup time, hash the query vector to get its group of similar vectors.
What does it mean “maximize collisions” in the context of LSH?
It means you want the hash function to put similar vectors in the same bucket (collide them).
What are the pros and cons of LSH?
Pros: It’s a middle ground on everything – decent performance but also decent speed and storage. Cons are that it’s not amazing at any one thing, and also susceptible to the curse of dimensionality.