← All posts

Multi-language IVR systems with TVoice

A telecom or bank IVR running across India needs to handle dozens of language variants without making the caller wait. Three engineering challenges show up again and again:

1. Language detection on the first utterance

Asking the caller to "press 1 for English, 2 for Hindi" works but adds 4-6 seconds of friction. A better pattern: stream the first 1-2 seconds of audio through a lightweight language identifier, pick the language, then continue the conversation in it. TVoice's language ID runs in ~80ms and supports the twelve most-spoken Indian languages.

2. Keeping total latency under 200ms

For an IVR to feel like a conversation rather than a form, the round-trip from "caller stops speaking" to "agent starts responding" has to stay below ~200ms. That budget gets eaten quickly:

  • STT streaming partial: 50-80ms
  • LLM/dialog logic: 40-80ms
  • TTS first byte: 30-60ms
  • Network: 20-40ms

Cutting any of these in half is worth more than adding a feature elsewhere.

3. Graceful failover when STT confidence drops

When the caller is on a noisy line or speaking a rare dialect, STT confidence drops. A good IVR detects this and falls back — to a clarifying question ("kya aap dobara bolenge?"), to a touch-tone menu, or to a human agent — without silently misrouting the call.

These three patterns are what most of our IVR deployments share. If you're designing one, we'd love to help.