EXP #002 Rg Rg Q-01 · Building RAG from Scratch · step 2 ✓ Achievement

HyDE: Embed the Answer You Wish You Had

When your query and its answer live in different parts of the embedding space — and how generating a hypothetical answer first bridges that gap.

2026-05-27 6 MIN READ COMPLETE

HYPOTHESIS

H₀

H₀ Embedding a hypothetical answer to a query will produce a vector closer to relevant documents than embedding the raw query directly, because answers and answers share vocabulary in a way questions and answers do not.

THE PROBLEM

Dense retrieval embeds your query and compares it against stored document chunks. The assumption is that similar meaning lands in the same region of the embedding space. That assumption holds — until it doesn’t.

The vocabulary gap: a user’s question sounds nothing like the answer in your knowledge base. “How do I fix a segmentation fault?” vs. “Memory access violations occur when a process reads or writes outside its allocated address space.” Both are semantically related. But a question-shaped query and an answer-shaped document embed into different regions — the question gets no signal, and retrieval fails silently.

HyDE (Hypothetical Document Embeddings) solves this with one insight: instead of embedding the question, ask the LLM to generate a hypothetical answer first — then embed that.

LAYMAN EXPLANATION

Imagine a library where every book is filed by what the content says, not by the questions it answers. If you walk in asking “why does my program crash?”, the librarian might not find anything — because no book is titled that. But if you instead ask “a program crashes when it tries to read memory it doesn’t own — what causes that?”, the filing system finds the right shelf immediately.

HyDE is the librarian’s trick: before searching, rephrase your question as a partial answer. The LLM drafts a plausible document. That document-shaped text gets embedded and searched — and lands in the right neighborhood of the vector space because it sounds like the kind of content your knowledge base actually contains.

The LLM might be wrong on the facts (it’s a hypothetical answer). That doesn’t matter. What matters is the shape of the text — answer-shaped language retrieves answer-shaped documents.

LIVE DEMO

interactive

Type any query below. HyDE generates a hypothetical answer and shows you what actually gets embedded — the answer, not the question.

01YOUR QUERY
ENTER ↵

↑ Notice that the output reads like a document, not a question. That’s the vocabulary gap closing.

THE MATH

interactive

Standard dense retrieval — embed the query directly:

$\text{score}(q, d) = \cos(E(q),\, E(d))$

HyDE replaces the raw query embedding with the embedding of the hypothetical document $\tilde{d}$ :

$\tilde{d} = \text{LLM}(q) \qquad \text{score}(q, d) = \cos(E(\tilde{d}),\, E(d))$

The hypothesis is that $\cos(E(\tilde{d}), E(d)) > \cos(E(q), E(d))$ for most semantically relevant documents — because $\tilde{d}$ and $d$ share answer-register vocabulary even if the LLM got the details wrong.

The length of $\tilde{d}$ matters. Longer hypothetical answers cover more vocabulary — but vector dilution sets in. The embedding averages over more topics, pulling the vector away from any single concept. Drag the slider to see the effect:

PARAMETER SIMULATOR · HYPOTHESIS LENGTH
HYPOTHESIS LENGTH3 sentences
too shortsweet spotdiluted
0.38
Raw query
vs document
0.79
HyDE hypothesis
vs document
Note: Sweet spot. Dense, focused, maximum signal.
Values are illustrative — real similarity depends on embedding model and query-document pair. The shape of the curve (rise then plateau/drop) reflects the vector dilution effect.

DATA TABLE n=3

Approach	What gets embedded	Strength
Baseline	Raw query: "how does X work?"	Fast, no extra LLM call
HyDE	Hypothetical answer: "X works by..."	Bridges vocabulary gap, better recall
★ HyDE sweet spot	2–3 sentence hypothesis	Maximum similarity gain before dilution kicks in

REFERENCE PAPERS

DATA TABLE n=3

Paper	Year	Key contribution
Precise Zero-Shot Dense Retrieval without Relevance Labels (Gao et al.)	2022	Original HyDE paper — proposes embedding hypothetical documents for zero-shot dense retrieval
BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models (Thakur et al.)	2021	Benchmark used to evaluate HyDE across diverse retrieval tasks
Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al.)	2020	DPR baseline that HyDE improves upon — the standard bi-encoder retrieval approach

WHAT NEXT

HyDE rewrites the content being searched. The next technique operates on the abstraction level of the query — asking a broader, principle-level question first to retrieve foundational context before narrowing back to the specific answer. That’s Step-Back Prompting.

CONCLUSION

✓ ACHIEVEMENT

Hypothesis confirmed.

HyDE meaningfully closes the vocabulary gap between question-style queries and answer-style documents. The mechanism is simple: answers embed near answers, and a plausible-but-wrong hypothetical answer is enough to land in the right neighborhood. The LLM’s factual accuracy doesn’t matter — its register does.

The cost: one additional LLM call per query. The sweet spot is a 2–3 sentence hypothesis — long enough for dense vocabulary signal, short enough to avoid vector dilution. Worth it for knowledge bases where question phrasing diverges significantly from document phrasing.

★ RELATED EXPERIMENTS