EXP #001 Nt Nt Q-01 · Building RAG from Scratch · step 1 ✓ Achievement

Vector Spaces & Cosine Similarity: The Math Behind Dense Retrieval

Why two texts can mean the same thing without sharing a single word — and how a dot product turns that intuition into a retrieval engine.

2026-05-23 5 MIN READ COMPLETE

HYPOTHESIS

H₀

H₀ Semantic similarity between texts can be captured as geometric proximity in high-dimensional space, making cosine similarity a principled and interpretable retrieval signal.

METHOD

Worked through the math of vector embeddings, cosine similarity, and dense retrieval from first principles before writing a line of pipeline code. The goal: understand why each piece exists and what breaks if you remove it — not just how to call the API.

Tested understanding with three concrete questions at the end. This is the first entry in the Building RAG from Scratch quest — every concept gets its own note before it shows up in code.

OBSERVATIONS

01A language model maps text → a list of ~1,500 numbers (a vector). Semantically similar texts land near each other because the model was trained on millions of sentence pairs to pull similar meanings toward the same region. You never interpret individual dimensions — the direction the vector points carries the meaning.
02Cosine similarity measures the angle between two vectors, not the distance between them. This matters: a short sentence and a long paragraph covering the same topic produce vectors at different distances from the origin but pointing in the same direction.
03Vectors A=[1,0,0] and B=[0,1,0] have cosine similarity 0 — perpendicular, unrelated. The dot product is 0, and dividing by the product of magnitudes leaves 0.
04Long documents produce diluted vectors: many topics averaged together pull the vector off the query direction. A short, focused chunk points cleanly at one concept with no noise. This is the core reason chunking exists. Try the “Diluted by noise” preset below.
★Dense retrieval fails on opaque identifiers — error codes, product IDs, rare jargon — because embedding models have no semantic signal to grab. BM25 (exact token matching) fills this gap. Production retrieval is always hybrid because real query distributions contain both semantic and exact-match queries.

THE MATH

Cosine similarity between two vectors A and B:

$\text{cosine}(A, B) = \frac{A \cdot B}{\|A\| \cdot \|B\|}$

Score interpretation:

DATA TABLE n=3

Score	Geometric meaning	Semantic meaning
1.0	Same direction	Semantically identical
0.0	Perpendicular	Unrelated
-1.0	Opposite direction	Semantically opposite

In practice, most embedding models produce scores between 0–1 (no negatives) due to ReLU activations. Dense retrieval at query time is: embed the question → compute cosine similarity against every stored chunk vector → return top-k.

PLAYGROUND

interactive

Move the sliders or hit a preset. Watch the angle arc and similarity score update live. The formula panel on the left shows every step of the calculation — the same math that runs inside ChromaDB on every query.

VECTOR PLAYGROUND

Presets

▶ Vector A — "query"

A.x0.75

A.y0.55

▶ Vector B — "document chunk"

B.x0.60

B.y0.70

A · B = (0.75)(0.60) + (0.55)(0.70)
= 0.835
‖A‖ = 0.930   ‖B‖ = 0.922
cosine = 0.835 / (0.93 × 0.92)
= 0.974 — Similar

cosine similarity
0.974
13° between vectors — Similar

↑ Try “Diluted by noise” — that’s what a long document embedding looks like compared to a focused query. A and B point in vaguely the same direction, but the score drops because noise pulled B off-axis.

CONCLUSION

✓ ACHIEVEMENT

Hypothesis confirmed.

Cosine similarity over embedding vectors is a principled retrieval signal — geometric direction encodes semantic meaning in a way that’s measurable and interpretable. But the signal only holds for semantic queries. Exact-match queries (error codes, IDs, rare jargon) break it, because the embedding model has nothing to embed beyond surface tokens.

The practical conclusion: production retrieval is always hybrid. Dense handles “I don’t know the exact phrasing.” BM25 handles “I know the exact string.” Neither wins alone.

WHAT NEXT

BM25 — TF-IDF, the saturation parameters k₁ and b, and why keyword search still beats dense retrieval on specific query types. Then: how to merge both signals with Reciprocal Rank Fusion.