Skip to main content
Build AI-powered voice agents that have natural conversations with callers using Plivo’s Audio Streaming. Stream live audio to your AI services (STT, LLM, TTS) via WebSocket and respond in real-time.

Prerequisites

Before building your AI voice agent, you’ll need:
RequirementDescription
Plivo AccountSign up and get your Auth ID and Auth Token
Phone NumberPurchase a voice-enabled number to receive/make calls
- India: Requires KYC verification. See Rent India Numbers.
WebSocket ServerA publicly accessible server to handle audio streams (use ngrok for development)
AI Service CredentialsAPI keys for your chosen providers:
- Speech-to-Text (STT): Deepgram, Google Speech, AWS Transcribe, etc.
- LLM: OpenAI, Anthropic, Google Gemini, etc.
- Text-to-Speech (TTS): ElevenLabs, Google TTS, Amazon Polly, etc.

Voice API Basics

Audio Streaming builds on Plivo’s Voice API. The core workflow is:
  1. Make or receive a call using the Call API
  2. Control the call using Plivo XML responses
  3. Stream audio using the <Stream> XML element
For complete Voice API documentation, see Voice API Overview.

What is Audio Streaming?

Audio Streaming gives you access to the raw audio of voice calls in real-time via WebSockets. This enables:
  • AI Voice Assistants - Natural conversations with speech recognition and synthesis
  • Real-time Transcription - Live call transcription for analytics
  • Voice Bots - Automated IVR systems with intelligent responses
  • Sentiment Analysis - Real-time audio analysis during calls

How It Works

       ┌─────────────────┐
       │     Caller      │
       │    (Phone)      │
       └────────┬────────┘


       ┌─────────────────┐
       │   Plivo Voice   │
       │      API        │
       └────────┬────────┘

                │ Audio Stream (WebSocket)

       ┌─────────────────┐
       │   Your App      │
       │  (WebSocket)    │
       └────────┬────────┘

                │ API Calls

       ┌─────────────────┐
       │  AI Services    │
       │  STT/LLM/TTS    │
       └─────────────────┘
Flow:
  1. Caller dials your Plivo number (or you make an outbound call)
  2. Plivo connects to your WebSocket endpoint and starts streaming audio
  3. Your app sends audio to STT for transcription
  4. Transcribed text goes to LLM for response generation
  5. LLM response is converted to speech via TTS
  6. Audio is sent back through WebSocket to the caller

Stream Directions

Inbound Stream (Unidirectional)

Audio flows from the caller to your server. Use this when you only need to receive audio (e.g., transcription, call analytics).
<Stream bidirectional="false">
    wss://your-server.com/stream
</Stream>

Bidirectional Stream

Audio flows both directions - from caller to your server AND from your server back to the caller. Use this for AI voice agents that need to respond.
<Stream bidirectional="true" keepCallAlive="true">
    wss://your-server.com/stream
</Stream>
For AI voice agents, always use bidirectional="true" and keepCallAlive="true" to maintain the call while your agent processes and responds.

Supported Audio Formats

Choose the audio codec and sample rate based on your use case:
Content TypeCodecSample RateDescriptionUse Case
audio/x-mulaw;rate=8000μ-law (PCMU)8 kHzCompressed 8-bit audioRecommended for Voice AI. Native telephony format with lowest latency and best compatibility.
audio/x-l16;rate=8000Linear PCM8 kHzUncompressed 16-bit audioHigher quality audio when bandwidth is not a concern.
audio/x-l16;rate=16000Linear PCM16 kHzUncompressed 16-bit audioHigh-fidelity speech recognition requiring wideband audio.
Why μ-law 8kHz? It’s the native telephony codec, so no transcoding is required. This means lower latency, reduced bandwidth (50% smaller than Linear PCM), and universal compatibility with STT/TTS services.

Latency Considerations

For responsive voice AI, understanding and minimizing latency is critical.

Latency Sources

ComponentDescriptionTarget
Codec ProcessingAudio encoding/decoding overheadμ-law has near-zero overhead (native format)
Network (WebSocket)Round-trip time between Plivo and your server< 100ms (deploy server near caller regions)
Speech-to-TextTime to transcribe audio to text< 200ms
LLM ProcessingTime for AI to generate response< 500ms
Text-to-SpeechTime to convert text to audio< 200ms
TotalEnd-to-end response time< 1 second

Codec Impact on Latency

CodecLatency ImpactNotes
audio/x-mulaw;rate=8000LowestNo transcoding required; native telephony format
audio/x-l16;rate=8000LowMinimal processing, but larger payload size
audio/x-l16;rate=16000ModerateLarger payloads; only use if STT model specifically benefits

Best Practices for Low-Latency Voice AI

  1. Use μ-law 8kHz - Avoid unnecessary transcoding
  2. Co-locate your server - Deploy near your expected caller regions (e.g., US East for US traffic)
  3. Use streaming APIs - Choose STT/TTS providers with streaming support
  4. Implement interruption - Use clearAudio to stop playback when user speaks
  5. Optimize LLM calls - Use streaming responses and appropriate model sizes
Plivo routes calls through edge locations closest to the caller. A caller in London connects to Plivo’s London edge, so position your WebSocket server near your expected caller locations.

Basic Implementation

1. Configure Plivo to Stream Audio

Create an XML application that streams audio to your WebSocket:
<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Speak>Connected to AI Assistant.</Speak>
    <Stream
        keepCallAlive="true"
        bidirectional="true"
        contentType="audio/x-mulaw;rate=8000"
        statusCallbackUrl="https://your-domain.com/stream-status">
        wss://your-domain.com/stream
    </Stream>
</Response>

2. Handle WebSocket Connection

Your server receives the WebSocket connection and processes events:
# Simplified example
async def handle_websocket(websocket):
    async for message in websocket:
        event = json.loads(message)

        if event["event"] == "start":
            # Stream started - initialize AI services
            stream_id = event["start"]["streamId"]

        elif event["event"] == "media":
            # Audio received - send to STT
            audio_bytes = base64.b64decode(event["media"]["payload"])
            transcript = await speech_to_text(audio_bytes)

            if transcript:
                # Get AI response
                response = await get_llm_response(transcript)

                # Convert to speech and send back
                audio = await text_to_speech(response)
                await websocket.send(json.dumps({
                    "event": "playAudio",
                    "media": {
                        "contentType": "audio/x-mulaw",
                        "sampleRate": 8000,
                        "payload": base64.b64encode(audio).decode()
                    }
                }))

Next Steps