Some more resources:
- A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge, https://arxiv.org/abs/2310.11703
- Survey of Vector Database Management Systems, https://arxiv.org/abs/2310.14021
- What are Embeddings, https://raw.githubusercontent.com/veekaybee/what_are_embeddi...
---
h/t: https://twitter.com/eatonphil/status/1745524630624862314 and https://twitter.com/ChristophMolnar/status/17457316026829826...
Here's my attempt at this: Embeddings: What they are and why they matter https://simonwillison.net/2023/Oct/23/embeddings/
Available as an annotated presentation + an optional 38 minute video.
Nicely written and very approachable. Might add a paragraph on why to use cosine similarity, as it gives a chance to illustrate how the n-dimensional vector embedding is used.
I still don't understand why cosine similarity is used (other than maybe being cheap to compute). I've read a bunch of hand-wavey articles but none of them add up for me.
If all your vectors have the same 2-norm, the cosine similarity of two vectors is a monotonic function of their euclidean distance (i.e. the minimum in one measure will be the minimum in the other).
If all your vectors don't have the same norm, where's the sense in projecting them onto the unit sphere and measuring the angle between them? You're throwing away important information about the embedding.
Could anyone explain?
I asked it some time ago: https://cs.stackexchange.com/questions/147713/why-word-embed...
There is other aspect: if all vectors are normalized then cosine distance is just dot product. It is faster to calculate than euclidean distance.
I have to say, neither of those answers are very satisfying. Neither is wrong but I don't come away with a clear explanation for why cosine similarity is the obvious metric to use.
For euclidean distance, we have (q-v)^2 = qq + vv - 2qv. The qq and vv don't tell us anything about the relationship between q and v, so are irrelevant to search. So really we are mostly trying to maximize inner product when we search.
Additionally, the scale of q (the query) is constant across any given search, so might as well normalize q; the results won't change.
Then the only question is whether the scale of v matters... It, again, doesn't tell us anything about the relationship between q and v. So, might as well normalize... And then we are just doing cosine similarity.
Cosine similarity is not affected by the size of the vector but only by the angle between them. This means that vectors with large or small values will have the same cosine similarity as long as they point in the same direction.
In semantic search the magnitude of vectors isn’t relevant - if you needed a measure that took magnitude into account than Euclidean would make sense. E.g. image embeddings based on pixel intensity.
It's not obvious to me that vector size isn't relevant in semantic search. What is it about the training process for semantic search that makes that the case?
Honestly, I use cosine similarity because that's what everyone else does and I haven't spent even a moment considering any of the alternatives.
Here's my attempt.
Think of partitioning the space by taking random hyperplanes. This can be computed by computing the dot product of the normal to the plane with any given point to see which side of the plane the point is on.
For two points that are very close to each other in Euclidean distance, the chance that a random hyperplane gets right in the middle of them is very low. This is, of course, the angle between as measured from the origin, but that angle gets very small the closer the two points are to each other, in Euclidean norm, and the further they are from the origin. By bisecting the space randomly and seeing which side points are on, we get a decent hashing function, binning similar points together.
Now, a point could be very close the origin and have a small angle to another point very far from the origin but, in some sense, there needs to be a conspiracy for this to happen. If you assume some kind of random distribution of points in D dimensional space, drawn in some sort of uniform like distribution in the D dimensional hyper cube or hyper sphere, the chances that a point will be very near the origin go down, as does the chance it'll lie in a thin cone within some other point.
I believe the basic assumption here is that independent features will be randomly distributed where as correlated features will be clumped together (in Euclidean distance). The randomness pushes unrelated points away from the origin and gives decreasing probability that the angle between them is low.
If points aren't random enough, maybe many of them are very close to the origin while others aren't, or maybe there's some other dependence that causes problems, then this could foil the hyperplane trick.
I believe I heard this basic argument from a talk on locality-sensitive hashing [0]. I think I have the basic intuition, but maybe that talk will be more enlightening.
[0] https://youtu.be/t_8SpFV0l7A?si=pddMFZfrHsXlLgHq
Thanks for writing this one Simon, I read it some time ago and I just wanted to say thanks and recommend it to folks browsing the comments, it's really good!
Very helpful to make it clear, in concrete terms
Throwing a few more on here (mix of beginner and advanced):
- Wikipedia article: https://en.wikipedia.org/wiki/Vector_database
- Vector Database 101: https://zilliz.com/learn/introduction-to-unstructured-data
- ANN & Similarity search: https://vinija.ai/concepts/ann-similarity-search/
- Distributed database: https://15445.courses.cs.cmu.edu/fall2021/notes/21-distribut...
Throwing one more - https://www.pinecone.io/learn/vector-database/