HyDE: Embed the Answer You Wish You Had
When your query and its answer live in different parts of the embedding space — and how generating a hypothetical answer first bridges that gap.
HYPOTHESIS
H₀THE PROBLEM
Dense retrieval embeds your query and compares it against stored document chunks. The assumption is that similar meaning lands in the same region of the embedding space. That assumption holds — until it doesn’t.
The vocabulary gap: a user’s question sounds nothing like the answer in your knowledge base. “How do I fix a segmentation fault?” vs. “Memory access violations occur when a process reads or writes outside its allocated address space.” Both are semantically related. But a question-shaped query and an answer-shaped document embed into different regions — the question gets no signal, and retrieval fails silently.
HyDE (Hypothetical Document Embeddings) solves this with one insight: instead of embedding the question, ask the LLM to generate a hypothetical answer first — then embed that.
LAYMAN EXPLANATION
Imagine a library where every book is filed by what the content says, not by the questions it answers. If you walk in asking “why does my program crash?”, the librarian might not find anything — because no book is titled that. But if you instead ask “a program crashes when it tries to read memory it doesn’t own — what causes that?”, the filing system finds the right shelf immediately.
HyDE is the librarian’s trick: before searching, rephrase your question as a partial answer. The LLM drafts a plausible document. That document-shaped text gets embedded and searched — and lands in the right neighborhood of the vector space because it sounds like the kind of content your knowledge base actually contains.
The LLM might be wrong on the facts (it’s a hypothetical answer). That doesn’t matter. What matters is the shape of the text — answer-shaped language retrieves answer-shaped documents.
LIVE DEMO
interactiveType any query below. HyDE generates a hypothetical answer and shows you what actually gets embedded — the answer, not the question.
↑ Notice that the output reads like a document, not a question. That’s the vocabulary gap closing.
THE MATH
interactiveStandard dense retrieval — embed the query directly:
HyDE replaces the raw query embedding with the embedding of the hypothetical document :
The hypothesis is that for most semantically relevant documents — because and share answer-register vocabulary even if the LLM got the details wrong.
The length of matters. Longer hypothetical answers cover more vocabulary — but vector dilution sets in. The embedding averages over more topics, pulling the vector away from any single concept. Drag the slider to see the effect:
| Approach | What gets embedded | Strength |
|---|---|---|
| Baseline | Raw query: "how does X work?" | Fast, no extra LLM call |
| HyDE | Hypothetical answer: "X works by..." | Bridges vocabulary gap, better recall |
| ★ HyDE sweet spot | 2–3 sentence hypothesis | Maximum similarity gain before dilution kicks in |
REFERENCE PAPERS
| Paper | Year | Key contribution |
|---|---|---|
| Precise Zero-Shot Dense Retrieval without Relevance Labels (Gao et al.) | 2022 | Original HyDE paper — proposes embedding hypothetical documents for zero-shot dense retrieval |
| BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models (Thakur et al.) | 2021 | Benchmark used to evaluate HyDE across diverse retrieval tasks |
| Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al.) | 2020 | DPR baseline that HyDE improves upon — the standard bi-encoder retrieval approach |
WHAT NEXT
HyDE rewrites the content being searched. The next technique operates on the abstraction level of the query — asking a broader, principle-level question first to retrieve foundational context before narrowing back to the specific answer. That’s Step-Back Prompting.
CONCLUSION
HyDE meaningfully closes the vocabulary gap between question-style queries and answer-style documents. The mechanism is simple: answers embed near answers, and a plausible-but-wrong hypothetical answer is enough to land in the right neighborhood. The LLM’s factual accuracy doesn’t matter — its register does.
The cost: one additional LLM call per query. The sweet spot is a 2–3 sentence hypothesis — long enough for dense vocabulary signal, short enough to avoid vector dilution. Worth it for knowledge bases where question phrasing diverges significantly from document phrasing.