ChatFlow Admin

Why “just embeddings” tips over

The first demo bots of 2023 were seductively simple: parse a document, chunk it, run a standard embedding model, cosine similarity on a vector DB — done. For a polished FAQ that's enough. The moment real documents come in — policies with “Section 3.2 applies only when …”, tables with code columns, three-page PDFs full of cross-references — search becomes a dice roll.

The reason is that dense embeddings know exactly one signal: semantic closeness. They can't tell “returns policy for digital goods” from “returns policy for perishable goods” when the prose is similar. And they know nothing of the four words above the paragraph that carry its entire meaning.

The four-stage pipeline

ChatFlow retrieval is four stages deep, not one less. Each stage has a clear job and a clear reason it is not skipped.

Stage 1 — Understand: the query is analyzed and encoded — meaning and keywords both count.
Stage 2 — Hybrid search: semantic search and keyword search run side by side; both rankings are fused by rank position.
Stage 3 — Distill: a cross-encoder reads the finalists pairwise against the question and re-sorts them.
Stage 4 — Format: only the strongest passages reach the agent — with citation metadata.

Query

Encoder

Semantic

Keyword

Reranker

Best evidence

Four stages, one path, reproducible outcome.

The context prefix that changes everything

The one technical trick we are reluctant to call a trick is the context prefix. Before embedding, a small, inexpensive model generates one or two sentences per chunk that locate the chunk inside its document: “This section from the Returns Policy document describes the conditions under which digital products can be refunded within 14 days of purchase.”

The prefix is used only for embedding and reranking — never as part of the answer context. It costs once at ingestion and makes retrieval hits robust against ambiguous phrasing. In blind tests it moves retrieval precision noticeably — with zero effect on answer latency.

Keywords, because not everything is semantic

Dense embeddings correctly link “refund” and “return” — wonderful, as long as users write at the prose layer. They go blind the moment a query carries exact codes, SKUs, or technical tokens. “Error E-4102” has no semantic twin. Keyword search spots it immediately.

Keyword search runs server-side, directly inside the search index — no separate side pipeline to maintain. We fuse the two rankings by rank position rather than weighted scores: robust across scale differences, with no fragile tuning to babysit.

The product instinct “we only need embeddings” dies in every real knowledge base by the third day.

Reranker — the quality gate

The first candidates from hybrid search are noise with a signal thread. The reranker decides: a cross-encoder reads question and passage together and re-scores every pair — far more precisely than any embedding comparison could. We run it either as a cloud reranker or as an open model on your own hardware.

Both variants are cross-lingual: a German chunk is fairly scored against an English query. For tenants that need an offline-capable setup there is a fallback mode — cloud first, automatic handoff to the local instance under pressure. Customers notice nothing.

The one knob that matters: the relevance threshold

If there is one slider that changes retrieval quality more than any other, it's the reranker's relevance threshold. Set it loose and the system always answers — even from a dumpster. Set it strict and precision climbs, along with the number of honest “I have nothing solid on that” replies — which in support is often the honest answer. That trade-off is a product decision, not a purely technical one.

General concierge / FAQ: loose — better to help than to stay silent
Support with precise policy binding: strict — better to ask than to guess
Medical, legal, financial: very strict, with explicit “no match” behaviour
Tightening it blindly without checking logs
Letting the prompt vary the threshold per question

The boring things that matter anyway

The raw material — text, chunks, processing state — lives apart from the search index. The index itself is deliberately disposable: it can be rebuilt in minutes without re-scraping or re-processing. That makes incident recovery trivial and lets you experiment with chunking without putting production at risk.

Processing requests are batched instead of naively serialised: concurrent requests travel through the models in efficient batches. N concurrent users wait ~1× batch time instead of ~N× — in practice the difference between “the KB feels slow” and “the KB is invisible”.

Search signals

semantic + keyword

Quality gate

cross-encoder rerank

< 1s

End-to-end p50

incl. network

minutes

Index rebuild

no re-processing

Language is a retrieval decision

Every knowledge base has an optional content_language field. When set, the agent graph instructs the tool to formulate queries in that language — even when the user is writing English and the chunks are in German. The reason is keyword matching: it benefits massively when the query and the corpus share a language. The answer still comes back in the user's language — the agent translates internally.

Appendix: what the agent receives

One concrete close — the result the agent works with:

json

{
  "results": [
    {
      "source": "returns-policy.pdf · section 4",
      "context": "This section describes refund conditions for digital products.",
      "passage": "Digital products can be refunded within 14 days of purchase if …",
      "relevance": "high"
    },
    {
      "source": "faq.md · refunds",
      "context": "FAQ entry on how refunds are paid out.",
      "passage": "Refunds are returned to the original payment method within …",
      "relevance": "medium"
    }
  ],
  "citations": true
}

— No magic. One contract: a question in, cited evidence out.

Contextual retrieval in practice .

Why “just embeddings” tips over

The four-stage pipeline

The context prefix that changes everything

Keywords, because not everything is semantic

Reranker — the quality gate

The one knob that matters: the relevance threshold

The boring things that matter anyway

Language is a retrieval decision

Appendix: what the agent receives