Skip to main content
← All posts
Architecture Deep-Dive12 min readFeb 2026

Building a Voice AI Agent: From Speech Recognition to Production Calls

Voice AI isn't a chatbot with a microphone. Latency budgets, interruption handling, accent robustness, and the telephony stack that handles 200+ real business calls daily.

Voice AISpeech RecognitionTwilioVapiOpenAI RealtimeTelephony
D

Dhruv Tomar

AI Solutions Architect

Tech Stack

VapiOpenAI RealtimeTwilioPlivon8nZoho CRM

Architecture

Phone call -> Twilio/Plivo SIP -> Vapi Voice Agent (WebSocket) -> OpenAI Realtime (sub-1s STT+LLM+TTS) -> Tool calls (CRM lookup, appointment booking, FAQ retrieval) -> Response audio stream back to caller. Fallback: confidence < 0.7 -> human handoff with context.
200+ calls/day
Sub-1s latency
85% resolution without human
7 department coverage

Text chatbots are forgiving — users wait 2-3 seconds for a response without noticing. Voice AI has a 500ms budget before silence feels awkward. That constraint changes everything.

The Latency Budget: A natural conversation has ~200ms gaps between turns. Your voice AI pipeline: speech-to-text (100ms) + LLM inference (200-400ms) + text-to-speech (100ms) + network (50ms). Total: 450-650ms. Any slower and users start saying "hello? are you there?"

OpenAI Realtime API: This is the breakthrough. Instead of separate STT -> LLM -> TTS calls, Realtime processes speech-to-speech in one pipeline. Latency drops from 2-3 seconds (traditional stack) to under 1 second. It also handles interruptions natively — if the user starts talking mid-response, the AI stops and listens.

The Telephony Layer: Twilio handles incoming/outgoing calls and SIP integration. Vapi sits between Twilio and the LLM — it manages the voice agent session, tool calling, and conversation state. Why Vapi? Because building a production voice pipeline from scratch takes 3 months. Vapi gets you there in a day.

Handling Real-World Audio: Construction site background noise. Indian English accents with Hindi code-switching. Speaker talking while driving. Bad phone connections dropping syllables. You need: noise cancellation preprocessing, accent-robust STT models, and confidence scoring on every transcription.

The Confidence Threshold: Every AI response gets a confidence score. Above 0.7: respond normally. Between 0.5-0.7: add a confirmation ("Just to confirm, you're asking about invoice #4523?"). Below 0.5: "Let me connect you with a team member who can help." The handoff includes full conversation transcript so the human doesn't re-ask everything.

Tool Integration: During a call, the voice AI can: look up customer records in CRM, check appointment availability, create support tickets, send follow-up emails, and transfer to specific departments. Each tool call happens in the background while the AI fills with natural speech ("Let me check that for you...").

What Goes Wrong in Production: 1. Users speak over the AI — need barge-in detection 2. Long silence after AI asks a question — need silence timeout with gentle re-prompt 3. Call drops mid-conversation — need state recovery 4. Accented speech misunderstood — need verification loops for critical data (phone numbers, names, amounts) 5. AI gets stuck in a loop — need max-turn limits and escalation paths

The Business Model: Voice AI agent setup: $3,000-8,000. Monthly management: $500-1,500. For a business handling 200+ calls/day with 5 support reps ($2,000/month each), replacing 3 reps with AI saves $6,000/month. The ROI case sells itself.

Running Angelina in production for Onsite — handling real customer calls for a construction SaaS with 7 department coverage — has been the most complex and rewarding AI project I've built.

Want to build something like this?

I architect and deploy end-to-end AI systems — from MVP to revenue.

Let's Talk