ChatFlow Admin

01

The easy part

Building a voice demo today is ridiculously easy. Ten lines of code, an OpenAI Realtime key, a WebRTC handshake — and the browser holds a fluent conversation in any language you want. That's the part that seduces the product instinct: “let's roll voice everywhere”.

Most honest articles stop there. We'll keep going — because what glitters in the demo has different problems in production than the ones the API docs warn you about.

02

What voice actually is

Voice is not chat with audio. It is its own modality with its own constraints. There is no scrollback — whatever was said two seconds ago is gone unless the user holds it in their head. There are no lists, no code blocks, no links. And there is no natural pause in which a handoff can happen without sounding like “one moment, let me transfer you”.

The consequences: answers must be short (200 tokens ≈ 30 seconds of speech), tool calls must be fast (every second of silence feels unnatural), and routing must be decided up-front. Voice rewards simple graphs and punishes complexity mercilessly.

In text an agent may “think”. In voice, the user listens to you being silent.

03

Where voice works

The cases where voice carries its weight share a pattern: short, identity-bound tasks where the user's hands are tied up anyway. Appointment changes, simple status checks, re-ordering a product with a saved address, support identification before a human agent takes over.

The user speaks with a clear intent (booking, status, re-order)
The answer fits in two sentences
A real tool does something — it's not a FAQ lookup
The channel is phone or a screen-less device
The answer includes comparison tables or more than three options
The user has to type numbers, emails, or serials
It's a multi-step walkthrough with confirmations per step

04

Where voice tips over

On the flip side, we've seen voice projects fail because the conversation isn't a fit for voice. Advisory dialogues with five product options are trivial in text and brutal in voice. Technical troubleshooting flows with screenshots work in text in five minutes and don't work on voice at all. Anything the user has to re-read, save, or share does not belong on the phone.

The honest rule: voice is a second channel, not a replacement. The same agent backend can serve both a text widget and a phone call — but the skills, the answer length, and the level of detail should be configured per channel. Whoever just “puts the text persona on voice” ends up with an agent reading its own blogpost to the caller.

05

Telephony is a different planet

Browser voice over WebRTC is a walk in the park compared to classic telephony. On the phone we speak codec chains that have been compressing human speech since the 70s and drop anything outside the voice band. Semantic turn detection struggles because the frequencies are clipped. Robust, conservatively tuned server-side detection is the only reliable option here.

We support several roads into the phone network: established cloud providers like Twilio and Vonage — the latter with EU data residency — plus a self-hosted SIP option that connects to your own phone system with full control. Each road brings its own audio chain; the platform takes care of the uncomfortable adaptation details. No gain applied, ever — on the phone you amplify noise, not voice.

06

Handoffs — the open problem

Time to be honest: cloud realtime APIs are single-agent at heart. The audio stream offers no natural break at which one agent could be swapped for another. The only strategy that holds up is distilling multiple roles into a single voice prompt — with one tool that hands off to a human when it gets complicated.

For tenants who want a real multi-agent setup on the phone, we offer the self-hosted path: voice-activity detection, speech recognition, the full agent graph with handoffs and tools, then speech synthesis — open models on your hardware, end to end. Latency sits somewhat above the cloud path, but you get the same graph as in text. Pros and cons on the table — we don't sell an illusion.

When voice is actually useful — and when it isn't.

The easy part

What voice actually is

Where voice works

Where voice tips over

Telephony is a different planet

Handoffs — the open problem