01

The checkbox trap

When a product has been built exclusively against the OpenAI API, “self-hosting” is a euphemism for “we'll park the binary somewhere else”. The intelligent layer — the actual work — still sits in the cloud. The customer ends up with an expensive container and no real sovereignty. That isn't self-hosting, that's theatre.

Real self-hosting means: every model, every component that sees data, can be swapped for a locally running alternative without the product falling over. That's an architectural decision made up-front, not a feature retrofitted later.

02

What actually has to move

Five components see sensitive data when ChatFlow works: the embedding model, the reranker, the main LLM, ASR (voice in), TTS (voice out). For each of them we operate a self-hosted option in production.

  • Embeddings: BGE-M3, 1024-dim, CPU-capable, GPU recommended — dedicated micro-service with batcher
  • Reranker: Qwen3-Reranker, TEI-compatible API, 0.6B model for dev, 4B on CUDA for prod
  • LLM: anything that speaks OpenAI-compatible — vLLM, Ollama, Azure OpenAI in the EU
  • ASR: Qwen3-ASR via MLX (Mac) or vLLM (Linux) — WebSocket stream, Silero VAD in front
  • TTS: Qwen3-TTS, sentence-wise streaming — the caller hears sentence one while the LLM produces sentence two
03

What's cheap, what's expensive

Embeddings and rerankers are affordable to self-host: a single A10 / L4 / RTX 4090 handles thousands of requests per day. The cost curve bends at the main LLM: a Qwen3-32B wants an H100 or two A100s — and then your hosting price competes with OpenAI's list, where they ship the model at half your cost. That's the uncomfortable truth of self-hosting foundation models.

So the pragmatic recommendation is: embeddings + reranker + small models (context prefix, classification, triage) locally. The large main model, as long as it's defensible, via Azure OpenAI EU or an equivalent managed offering with EU data residency. That's not a compromise, that's economic honesty.

GPU for embeddings
A10 / L4 / 4090
GPU for reranker
Qwen3-Reranker 4B
GPU for large LLM
A100 / H100
~EU
Data residency
stays in house
04

The hybrid that works in practice

For most mid-market customers the sweet spot is a hybrid: datastores and inference layers that see customer data live locally (Postgres, Mongo, Qdrant, embeddings, reranker, optional voice). Calls to a large foundation model leave the building — but only the question plus the selected chunks. No raw documents, no full conversations, no tenant context beyond what's needed.

Self-hosting isn't “everything at our place”. Self-hosting is “what has to stay with us, stays — and we know exactly what leaves”.

05

The pricing conversation

Self-hosting forces a different pricing conversation. Cloud SaaS is “per conversation” or “per token”. On-prem is usually a licence plus an operational fee. Neither is hardwired into ChatFlow — the platform supports both paths. That's intentional: if your customers need sovereignty, you cannot insist that the invoice look like a volume SaaS invoice.

But it also means: the self-hosted option cannot be a surcharge on top of the cloud version; it has to be the actual substance. Self-hosted customers get less convenient updates and more control. Cloud customers get more convenience and accept documented cross-border transfers. Both contracts are honest.