Vector Spaces & Cosine Similarity: The Math Behind Dense Retrieval
Why two texts can mean the same thing without sharing a single word — and how a dot product turns that intuition into a retrieval engine.
HYPOTHESIS
H₀METHOD
Worked through the math of vector embeddings, cosine similarity, and dense retrieval from first principles before writing a line of pipeline code. The goal: understand why each piece exists and what breaks if you remove it — not just how to call the API.
Tested understanding with three concrete questions at the end. This is the first entry in the Building RAG from Scratch quest — every concept gets its own note before it shows up in code.
OBSERVATIONS
- 01A language model maps text → a list of ~1,500 numbers (a vector). Semantically similar texts land near each other because the model was trained on millions of sentence pairs to pull similar meanings toward the same region. You never interpret individual dimensions — the direction the vector points carries the meaning.
- 02Cosine similarity measures the angle between two vectors, not the distance between them. This matters: a short sentence and a long paragraph covering the same topic produce vectors at different distances from the origin but pointing in the same direction.
- 03Vectors A=[1,0,0] and B=[0,1,0] have cosine similarity 0 — perpendicular, unrelated. The dot product is 0, and dividing by the product of magnitudes leaves 0.
- 04Long documents produce diluted vectors: many topics averaged together pull the vector off the query direction. A short, focused chunk points cleanly at one concept with no noise. This is the core reason chunking exists. Try the “Diluted by noise” preset below.
- ★Dense retrieval fails on opaque identifiers — error codes, product IDs, rare jargon — because embedding models have no semantic signal to grab. BM25 (exact token matching) fills this gap. Production retrieval is always hybrid because real query distributions contain both semantic and exact-match queries.
THE MATH
Cosine similarity between two vectors A and B:
Score interpretation:
| Score | Geometric meaning | Semantic meaning |
|---|---|---|
| 1.0 | Same direction | Semantically identical |
| 0.0 | Perpendicular | Unrelated |
| -1.0 | Opposite direction | Semantically opposite |
In practice, most embedding models produce scores between 0–1 (no negatives) due to ReLU activations. Dense retrieval at query time is: embed the question → compute cosine similarity against every stored chunk vector → return top-k.
PLAYGROUND
interactiveMove the sliders or hit a preset. Watch the angle arc and similarity score update live. The formula panel on the left shows every step of the calculation — the same math that runs inside ChromaDB on every query.
↑ Try “Diluted by noise” — that’s what a long document embedding looks like compared to a focused query. A and B point in vaguely the same direction, but the score drops because noise pulled B off-axis.
CONCLUSION
Cosine similarity over embedding vectors is a principled retrieval signal — geometric direction encodes semantic meaning in a way that’s measurable and interpretable. But the signal only holds for semantic queries. Exact-match queries (error codes, IDs, rare jargon) break it, because the embedding model has nothing to embed beyond surface tokens.
The practical conclusion: production retrieval is always hybrid. Dense handles “I don’t know the exact phrasing.” BM25 handles “I know the exact string.” Neither wins alone.
WHAT NEXT
BM25 — TF-IDF, the saturation parameters k₁ and b, and why keyword search still beats dense retrieval on specific query types. Then: how to merge both signals with Reciprocal Rank Fusion.