01

Why “just embeddings” tips over

The first demo bots of 2023 were seductively simple: parse a document, chunk it, run BGE or OpenAI-ada, cosine similarity on a vector DB — done. For a polished FAQ that's enough. The moment real documents come in — policies with “Section 3.2 applies only when …”, tables with code columns, three-page PDFs full of cross-references — search becomes a dice roll.

The reason is that dense embeddings know exactly one signal: semantic closeness. They can't tell “returns policy for digital goods” from “returns policy for perishable goods” when the prose is similar. And they know nothing of the four words above the paragraph that carry its entire meaning.

02

The four-stage pipeline

ChatFlow retrieval is four stages deep, not one less. Each stage has a clear job and a clear reason it is not skipped.

  • Stage 1 — Encode: the query becomes a 1024-dim dense vector via BGE-M3.
  • Stage 2 — Hybrid search: Qdrant returns dense hits and server-side BM25 hits, fused by Reciprocal Rank Fusion (RRF).
  • Stage 3 — Rerank: a cross-encoder (Cohere rerank-v4.0-pro or Qwen3-Reranker) scores the top 100 pairwise.
  • Stage 4 — Format: the final 15 chunks are passed to the agent with citation metadata.
Query
BGE-M3
Qdrant
BM25
Reranker
15 chunks
Four stages, one path, reproducible outcome.
03

The context prefix that changes everything

The one technical trick we are reluctant to call a trick is the Anthropic-style context prefix. Before embedding, a cheap model (GPT-4o-mini, ~256 tokens) generates one or two sentences per chunk that locate the chunk inside its document: “This section from the Returns Policy document describes the conditions under which digital products can be refunded within 14 days of purchase.”

The prefix is used only for embedding and reranking — never as part of the answer context. It costs once at ingestion and makes retrieval hits robust against ambiguous phrasing. In blind tests it moves retrieval precision noticeably — with zero effect on answer latency.

04

BM25, because not everything is semantic

Dense embeddings correctly link “refund” and “return” — wonderful, as long as users write at the prose layer. They go blind the moment a query carries exact codes, SKUs, or technical tokens. “Error E-4102” has no semantic twin. BM25 spots it immediately.

Qdrant has had server-side BM25 since 1.15. On ingestion we register a Document object with model Qdrant/bm25 and skip a separate sparse-vector pipeline. We fuse the two rankings with Reciprocal Rank Fusion — no weighted scores, only rank sums, robust across scale differences.

The product instinct “we only need embeddings” dies in every real knowledge base by the third day.

05

Reranker — the quality gate

The first 100 candidates from hybrid search are noise with a signal thread. The reranker decides. Cohere rerank-v4.0-pro is our cloud default: ~600ms, well-calibrated 0–1 scores, no GPU headaches. Self-hosted, we run Qwen3-Reranker — a causal LM that emits Yes/No logits per pair and normalises with a sigmoid.

Both are cross-lingual: a German chunk is fairly scored against an English query. For tenants that need an “offline-capable” fallback we use cohere_fallback — Cohere first, automatic handoff to the local instance on rate limits. Customers notice nothing.

python
# Self-hosted reranker with automatic Cohere fallback
settings.RERANKER_BACKEND = "cohere_fallback"

scored = await reranker.rerank(
    query=query,
    documents=candidates,           # top-100 from hybrid search
    top_k=15,
    score_threshold=0.15,           # quiet noise filter
)
06

The one knob that matters: score_threshold

If there is one slider that changes retrieval quality more than any other, it's this one. Reranker scores typically sit between 0 and 1. The default of 0.0 always returns 15 chunks — even from a dumpster. Lift the threshold to 0.1–0.2 and the worst noise disappears. 0.3–0.5 gives you high precision at the price of more empty result sets. Above 0.5 you routinely risk “no answer found” — which in support is often the honest answer.

  • General concierge / FAQ: 0.10–0.15
  • Support with precise policy binding: 0.25–0.35
  • Medical, legal, financial: 0.35+ with explicit “no match” behaviour
  • Raising it blindly without checking logs
  • Letting the prompt vary the threshold per question
07

The boring things that matter anyway

MongoDB is ground truth — raw text, chunks, embeddings, change log. Qdrant is a disposable projection: we can rebuild the index in minutes without re-scraping or re-embedding. That makes incident recovery trivial and lets you experiment with chunking without putting production at risk.

Embeddings run through a micro-batcher: instead of an asyncio.Lock with naive serialisation, we collect concurrent requests into GPU-efficient batches (max 64, max 50ms wait). N concurrent users wait ~1× batch time instead of ~N× — in practice the difference between “the KB feels slow” and “the KB is invisible”.

~600ms
Reranker
Cohere rerank-v4.0-pro
1024d
Dense vector
BGE-M3
100 → 15
Rerank window
candidates → final
< 1s
End-to-end p50
incl. network
08

Language is a retrieval decision

Every knowledge base has an optional content_language field. When set, the agent graph instructs the tool to formulate queries in that language — even when the user is writing English and the chunks are in German. The reason is BM25: keyword matching benefits massively when the query and the corpus share a language. The answer still comes back in the user's language — the agent translates internally.

09

Appendix: what a search call looks like

One concrete close — how the agent queries the knowledge base and what it receives back:

python
@register_builtin("search_kb")
class SearchKnowledgeBase(FunctionTool):
    """Hybrid search against the tenant KB with contextual retrieval."""

    async def __call__(
        self,
        query: str,
        top_k: int = 15,
        score_threshold: float = 0.15,
    ) -> list[Chunk]:
        dense = await bge.embed_query(query)
        candidates = await qdrant.hybrid_search(
            tenant_id=ctx.tenant_id,
            dense=dense,
            bm25_text=query,
            limit=100,
        )
        reranked = await reranker.rerank(
            query=query,
            documents=candidates,
            top_k=top_k,
            score_threshold=score_threshold,
        )
        return reranked
No magic. One contract, four stages, one result.