The checkbox trap
When a product has been built exclusively against the OpenAI API, “self-hosting” is a euphemism for “we'll park the binary somewhere else”. The intelligent layer — the actual work — still sits in the cloud. The customer ends up with an expensive container and no real sovereignty. That isn't self-hosting, that's theatre.
Real self-hosting means: every model, every component that sees data, can be swapped for a locally running alternative without the product falling over. That's an architectural decision made up-front, not a feature retrofitted later.
What actually has to move
Five components see sensitive data when ChatFlow works: the embedding model, the reranker, the main LLM, ASR (voice in), TTS (voice out). For each of them we operate a self-hosted option in production.
- Embeddings: BGE-M3, 1024-dim, CPU-capable, GPU recommended — dedicated micro-service with batcher
- Reranker: Qwen3-Reranker, TEI-compatible API, 0.6B model for dev, 4B on CUDA for prod
- LLM: anything that speaks OpenAI-compatible — vLLM, Ollama, Azure OpenAI in the EU
- ASR: Qwen3-ASR via MLX (Mac) or vLLM (Linux) — WebSocket stream, Silero VAD in front
- TTS: Qwen3-TTS, sentence-wise streaming — the caller hears sentence one while the LLM produces sentence two
What's cheap, what's expensive
Embeddings and rerankers are affordable to self-host: a single A10 / L4 / RTX 4090 handles thousands of requests per day. The cost curve bends at the main LLM: a Qwen3-32B wants an H100 or two A100s — and then your hosting price competes with OpenAI's list, where they ship the model at half your cost. That's the uncomfortable truth of self-hosting foundation models.
So the pragmatic recommendation is: embeddings + reranker + small models (context prefix, classification, triage) locally. The large main model, as long as it's defensible, via Azure OpenAI EU or an equivalent managed offering with EU data residency. That's not a compromise, that's economic honesty.
The hybrid that works in practice
For most mid-market customers the sweet spot is a hybrid: datastores and inference layers that see customer data live locally (Postgres, Mongo, Qdrant, embeddings, reranker, optional voice). Calls to a large foundation model leave the building — but only the question plus the selected chunks. No raw documents, no full conversations, no tenant context beyond what's needed.
Self-hosting isn't “everything at our place”. Self-hosting is “what has to stay with us, stays — and we know exactly what leaves”.
The pricing conversation
Self-hosting forces a different pricing conversation. Cloud SaaS is “per conversation” or “per token”. On-prem is usually a licence plus an operational fee. Neither is hardwired into ChatFlow — the platform supports both paths. That's intentional: if your customers need sovereignty, you cannot insist that the invoice look like a volume SaaS invoice.
But it also means: the self-hosted option cannot be a surcharge on top of the cloud version; it has to be the actual substance. Self-hosted customers get less convenient updates and more control. Cloud customers get more convenience and accept documented cross-border transfers. Both contracts are honest.