01

The easy part

Building a voice demo today is ridiculously easy. Ten lines of code, an OpenAI Realtime key, a WebRTC handshake — and the browser holds a fluent conversation in any language you want. That's the part that seduces the product instinct: “let's roll voice everywhere”.

Most honest articles stop there. We'll keep going — because what glitters in the demo has different problems in production than the ones the API docs warn you about.

02

What voice actually is

Voice is not chat with audio. It is its own modality with its own constraints. There is no scrollback — whatever was said two seconds ago is gone unless the user holds it in their head. There are no lists, no code blocks, no links. And there is no natural pause in which a handoff can happen without sounding like “one moment, let me transfer you”.

The consequences: answers must be short (200 tokens ≈ 30 seconds of speech), tool calls must be fast (every second of silence feels unnatural), and routing must be decided up-front. Voice rewards simple graphs and punishes complexity mercilessly.

In text an agent may “think”. In voice, the user listens to you being silent.

03

Where voice works

The cases where voice carries its weight share a pattern: short, identity-bound tasks where the user's hands are tied up anyway. Appointment changes, simple status checks, re-ordering a product with a saved address, support identification before a human agent takes over.

  • The user speaks with a clear intent (booking, status, re-order)
  • The answer fits in two sentences
  • A real tool does something — it's not a FAQ lookup
  • The channel is phone or a screen-less device
  • The answer includes comparison tables or more than three options
  • The user has to type numbers, emails, or serials
  • It's a multi-step walkthrough with confirmations per step
04

Where voice tips over

On the flip side, we've seen voice projects fail because the conversation isn't a fit for voice. Advisory dialogues with five product options are trivial in text and brutal in voice. Technical troubleshooting flows with screenshots work in text in five minutes and don't work on voice at all. Anything the user has to re-read, save, or share does not belong on the phone.

The honest rule: voice is a second channel, not a replacement. The same agent backend can serve both a text widget and a phone call — but the skills, the answer length, and the level of detail should be configured per channel. Whoever just “puts the text persona on voice” ends up with an agent reading its own blogpost to the caller.

05

Telephony is a different planet

Browser voice over WebRTC is a walk in the park compared to classic telephony. On the phone we speak G.711, mulaw, 8 kHz — a codec chain that's been compressing human speech since the 70s and drops anything non-human. Semantic VAD struggles because the frequencies are clipped. Server VAD with fixed thresholds (800ms silence, threshold 0.5) is the only robust option here.

We support three providers: Twilio (mulaw, easy on-ramp), Vonage (L16 PCM 16kHz, EU data residency, highest audio quality), and FreeSWITCH (self-hosted, PBX extensions, full control). Each provider has its own audio pipeline in the bridge service — resampling 8k/16k → 24k for OpenAI Realtime, back the other way on the return trip. No gain applied, ever — on the phone you amplify noise, not voice.

06

Handoffs — the open problem

Time to be honest: OpenAI Realtime is single-agent. The audio stream offers no natural break at which one agent could be swapped for another. The only strategy that holds up is distilling multiple roles into a single voice prompt — with one tool that hands off to a human when it gets complicated.

For tenants who want a real multi-agent setup on the phone, we offer the self-hosted path: Silero VAD → Qwen3-ASR → Agents SDK Runner (full graphs, handoffs, MCP) → Qwen3-TTS. Latency is 200–400ms worse than OpenAI Realtime, but you get the same graph as in text. Pros and cons on the table — we don't sell an illusion.