Building a Voice AI Agent: From Speech Recognition to Production Calls

Tech Stack

VapiOpenAI RealtimeTwilioPlivon8nZoho CRM

Architecture

Phone call -> Twilio/Plivo SIP -> Vapi Voice Agent (WebSocket) -> OpenAI Realtime (sub-1s STT+LLM+TTS) -> Tool calls (CRM lookup, appointment booking, FAQ retrieval) -> Response audio stream back to caller. Fallback: confidence < 0.7 -> human handoff with context.

200+ calls/day

Sub-1s latency

85% resolution without human

7 department coverage

Text chatbots are forgiving — users wait 2-3 seconds for a response without noticing. Voice AI has a 500ms budget before silence feels awkward. That constraint changes everything.

The Latency Budget: A natural conversation has ~200ms gaps between turns. Your voice AI pipeline: speech-to-text (100ms) + LLM inference (200-400ms) + text-to-speech (100ms) + network (50ms). Total: 450-650ms. Any slower and users start saying "hello? are you there?"

OpenAI Realtime API: This is the breakthrough. Instead of separate STT -> LLM -> TTS calls, Realtime processes speech-to-speech in one pipeline. Latency drops from 2-3 seconds (traditional stack) to under 1 second. It also handles interruptions natively — if the user starts talking mid-response, the AI stops and listens.

The Telephony Layer: Twilio handles incoming/outgoing calls and SIP integration. Vapi sits between Twilio and the LLM — it manages the voice agent session, tool calling, and conversation state. Why Vapi? Because building a production voice pipeline from scratch takes 3 months. Vapi gets you there in a day.

Handling Real-World Audio: Construction site background noise. Indian English accents with Hindi code-switching. Speaker talking while driving. Bad phone connections dropping syllables. You need: noise cancellation preprocessing, accent-robust STT models, and confidence scoring on every transcription.

The Confidence Threshold: Every AI response gets a confidence score. Above 0.7: respond normally. Between 0.5-0.7: add a confirmation ("Just to confirm, you're asking about invoice #4523?"). Below 0.5: "Let me connect you with a team member who can help." The handoff includes full conversation transcript so the human doesn't re-ask everything.

Tool Integration: During a call, the voice AI can: look up customer records in CRM, check appointment availability, create support tickets, send follow-up emails, and transfer to specific departments. Each tool call happens in the background while the AI fills with natural speech ("Let me check that for you...").

What Goes Wrong in Production: 1. Users speak over the AI — need barge-in detection 2. Long silence after AI asks a question — need silence timeout with gentle re-prompt 3. Call drops mid-conversation — need state recovery 4. Accented speech misunderstood — need verification loops for critical data (phone numbers, names, amounts) 5. AI gets stuck in a loop — need max-turn limits and escalation paths

The Business Model: Voice AI agent setup: $3,000-8,000. Monthly management: $500-1,500. For a business handling 200+ calls/day with 5 support reps ($2,000/month each), replacing 3 reps with AI saves $6,000/month. The ROI case sells itself.

Running Angelina in production for Onsite — handling real customer calls for a construction SaaS with 7 department coverage — has been the most complex and rewarding AI project I've built.

Want to build something like this?

I architect and deploy end-to-end AI systems — from MVP to revenue.

Let's Talk

Building a Voice AI Agent: From Speech Recognition to Production Calls

More from the build log

I Gave Claude Code Access to My Entire Business. Here's What Happened in 30 Days.

I Replaced 5 Hires With One AI System. Here's the Exact Stack.

My AI Setup Saves 15 Hours/Week Per Team Member — Here's How

Stop Building Features. Ship Businesses.