Case Study

Bilingual Voice Agent on LiveKit

LiveKitPythonFastAPIWhisperElevenLabsDeepgramVAD

The Challenge

The client needed a production voice assistant that felt natural in both English and Urdu — two languages with very different phonetic patterns, speech rhythms, and user expectations. Off-the-shelf voice pipelines either lacked Urdu support, had unacceptable latency, or produced robotic TTS output that broke trust with end users.

The Solution

  • LiveKit Agents Framework: Built the entire voice pipeline on LiveKit Agents, handling real-time audio streaming, room management, and agent dispatch. This eliminated the need for custom WebRTC infrastructure.
  • STT Pipeline: Used Whisper large-v3 for transcription, fine-tuned on domain-specific vocabulary. Deepgram served as a low-latency fallback for English-only flows where speed was prioritized over accuracy.
  • VAD-Based Turn Detection: Implemented Silero VAD for natural end-of-turn detection, eliminating the awkward fixed-timeout pauses of most voice bots. The agent only responds when the user has genuinely finished speaking.
  • TTS Output: ElevenLabs for English (natural, expressive voice) with a custom Urdu TTS pipeline for the bilingual requirement. Both voices were configured to match the client's brand tone.
  • LLM Orchestration: FastAPI backend orchestrates the STT → LLM → TTS pipeline with async processing to minimize compounding latency across each stage.

Key Results

  • Sub-200ms end-to-end voice response latency (STT + LLM inference + TTS) in production.
  • Natural bilingual conversation in English and Urdu with consistent persona and tone.
  • VAD-based turn detection eliminated the robotic pause patterns common in fixed-timeout systems.
  • Deployed on LiveKit Cloud with automatic room scaling for concurrent sessions.
  • Zero custom WebRTC infrastructure required — all media handling via LiveKit Agents SDK.

Project Details

CategoryVoice AI
RoleLead Developer