Prerequisites
Before building your AI voice agent, you’ll need:| Requirement | Description |
|---|---|
| Plivo Account | Sign up and get your Auth ID and Auth Token |
| Phone Number | Purchase a voice-enabled number to receive/make calls |
| - India: Requires KYC verification. See Rent India Numbers. | |
| WebSocket Server | A publicly accessible server to handle audio streams (use ngrok for development) |
| AI Service Credentials | API keys for your chosen providers: |
| - Speech-to-Text (STT): Deepgram, Google Speech, AWS Transcribe, etc. | |
| - LLM: OpenAI, Anthropic, Google Gemini, etc. | |
| - Text-to-Speech (TTS): ElevenLabs, Google TTS, Amazon Polly, etc. |
Voice API Basics
Audio Streaming builds on Plivo’s Voice API. The core workflow is:- Make or receive a call using the Call API
- Control the call using Plivo XML responses
- Stream audio using the
<Stream>XML element
What is Audio Streaming?
Audio Streaming gives you access to the raw audio of voice calls in real-time via WebSockets. This enables:- AI Voice Assistants - Natural conversations with speech recognition and synthesis
- Real-time Transcription - Live call transcription for analytics
- Voice Bots - Automated IVR systems with intelligent responses
- Sentiment Analysis - Real-time audio analysis during calls
How It Works
- Caller dials your Plivo number (or you make an outbound call)
- Plivo connects to your WebSocket endpoint and starts streaming audio
- Your app sends audio to STT for transcription
- Transcribed text goes to LLM for response generation
- LLM response is converted to speech via TTS
- Audio is sent back through WebSocket to the caller
Stream Directions
Inbound Stream (Unidirectional)
Audio flows from the caller to your server. Use this when you only need to receive audio (e.g., transcription, call analytics).Bidirectional Stream
Audio flows both directions - from caller to your server AND from your server back to the caller. Use this for AI voice agents that need to respond.For AI voice agents, always use
bidirectional="true" and keepCallAlive="true" to maintain the call while your agent processes and responds.Supported Audio Formats
Choose the audio codec and sample rate based on your use case:| Content Type | Codec | Sample Rate | Description | Use Case |
|---|---|---|---|---|
audio/x-mulaw;rate=8000 | μ-law (PCMU) | 8 kHz | Compressed 8-bit audio | Recommended for Voice AI. Native telephony format with lowest latency and best compatibility. |
audio/x-l16;rate=8000 | Linear PCM | 8 kHz | Uncompressed 16-bit audio | Higher quality audio when bandwidth is not a concern. |
audio/x-l16;rate=16000 | Linear PCM | 16 kHz | Uncompressed 16-bit audio | High-fidelity speech recognition requiring wideband audio. |
Latency Considerations
For responsive voice AI, understanding and minimizing latency is critical.Latency Sources
| Component | Description | Target |
|---|---|---|
| Codec Processing | Audio encoding/decoding overhead | μ-law has near-zero overhead (native format) |
| Network (WebSocket) | Round-trip time between Plivo and your server | < 100ms (deploy server near caller regions) |
| Speech-to-Text | Time to transcribe audio to text | < 200ms |
| LLM Processing | Time for AI to generate response | < 500ms |
| Text-to-Speech | Time to convert text to audio | < 200ms |
| Total | End-to-end response time | < 1 second |
Codec Impact on Latency
| Codec | Latency Impact | Notes |
|---|---|---|
audio/x-mulaw;rate=8000 | Lowest | No transcoding required; native telephony format |
audio/x-l16;rate=8000 | Low | Minimal processing, but larger payload size |
audio/x-l16;rate=16000 | Moderate | Larger payloads; only use if STT model specifically benefits |
Best Practices for Low-Latency Voice AI
- Use μ-law 8kHz - Avoid unnecessary transcoding
- Co-locate your server - Deploy near your expected caller regions (e.g., US East for US traffic)
- Use streaming APIs - Choose STT/TTS providers with streaming support
- Implement interruption - Use
clearAudioto stop playback when user speaks - Optimize LLM calls - Use streaming responses and appropriate model sizes
Plivo routes calls through edge locations closest to the caller. A caller in London connects to Plivo’s London edge, so position your WebSocket server near your expected caller locations.
Basic Implementation
1. Configure Plivo to Stream Audio
Create an XML application that streams audio to your WebSocket:2. Handle WebSocket Connection
Your server receives the WebSocket connection and processes events:Next Steps
Audio Streaming Guide
Complete documentation: XML configuration, WebSocket protocol, APIs, callbacks, signature validation, and code examples
Best Practices
Troubleshooting tips and optimization recommendations
Plivo Stream SDK
Official SDKs for Python, Node.js, and Java with built-in audio handling
Pipecat Integration
Build with Pipecat framework for higher-level abstraction
Related
- Voice API Overview - Core voice platform concepts
- Voice API Reference - Complete API documentation
- XML Reference - All XML elements for call control