Building AI Voice Agents with Audio Streaming

Build AI-powered voice agents that have natural conversations with callers using Plivo’s Audio Streaming. Stream live audio to your AI services (STT, LLM, TTS) via WebSocket and respond in real-time.

Prerequisites

Before building your AI voice agent, you’ll need:

Requirement	Description
Plivo Account	Sign up and get your Auth ID and Auth Token
Phone Number	Purchase a voice-enabled number to receive/make calls
	- India: Requires KYC verification. See Rent India Numbers.
WebSocket Server	A publicly accessible server to handle audio streams (use ngrok for development)
AI Service Credentials	API keys for your chosen providers:
	- Speech-to-Text (STT): Deepgram, Google Speech, AWS Transcribe, etc.
	- LLM: OpenAI, Anthropic, Google Gemini, etc.
	- Text-to-Speech (TTS): ElevenLabs, Google TTS, Amazon Polly, etc.

Voice API Basics

Audio Streaming builds on Plivo’s Voice API. The core workflow is:

Make or receive a call using the Call API
Control the call using Plivo XML responses
Stream audio using the <Stream> XML element

For complete Voice API documentation, see Voice API Overview.

What is Audio Streaming?

Audio Streaming gives you access to the raw audio of voice calls in real-time via WebSockets. This enables:

AI Voice Assistants - Natural conversations with speech recognition and synthesis
Real-time Transcription - Live call transcription for analytics
Voice Bots - Automated IVR systems with intelligent responses
Sentiment Analysis - Real-time audio analysis during calls

How It Works

       ┌─────────────────┐
       │     Caller      │
       │    (Phone)      │
       └────────┬────────┘
                │
                ▼
       ┌─────────────────┐
       │   Plivo Voice   │
       │      API        │
       └────────┬────────┘
                │
                │ Audio Stream (WebSocket)
                ▼
       ┌─────────────────┐
       │   Your App      │
       │  (WebSocket)    │
       └────────┬────────┘
                │
                │ API Calls
                ▼
       ┌─────────────────┐
       │  AI Services    │
       │  STT/LLM/TTS    │
       └─────────────────┘

Flow:

Caller dials your Plivo number (or you make an outbound call)
Plivo connects to your WebSocket endpoint and starts streaming audio
Your app sends audio to STT for transcription
Transcribed text goes to LLM for response generation
LLM response is converted to speech via TTS
Audio is sent back through WebSocket to the caller

Stream Directions

Inbound Stream (Unidirectional)

Audio flows from the caller to your server. Use this when you only need to receive audio (e.g., transcription, call analytics).

<Stream bidirectional="false">
    wss://your-server.com/stream
</Stream>

Bidirectional Stream

Audio flows both directions - from caller to your server AND from your server back to the caller. Use this for AI voice agents that need to respond.

<Stream bidirectional="true" keepCallAlive="true">
    wss://your-server.com/stream
</Stream>

For AI voice agents, always use bidirectional="true" and keepCallAlive="true" to maintain the call while your agent processes and responds.

Supported Audio Formats

Choose the audio codec and sample rate based on your use case:

Content Type	Codec	Sample Rate	Description	Use Case
`audio/x-mulaw;rate=8000`	μ-law (PCMU)	8 kHz	Compressed 8-bit audio	Recommended for Voice AI. Native telephony format with lowest latency and best compatibility.
`audio/x-l16;rate=8000`	Linear PCM	8 kHz	Uncompressed 16-bit audio	Higher quality audio when bandwidth is not a concern.
`audio/x-l16;rate=16000`	Linear PCM	16 kHz	Uncompressed 16-bit audio	High-fidelity speech recognition requiring wideband audio.

Why μ-law 8kHz? It’s the native telephony codec, so no transcoding is required. This means lower latency, reduced bandwidth (50% smaller than Linear PCM), and universal compatibility with STT/TTS services.

Latency Considerations

For responsive voice AI, understanding and minimizing latency is critical.

Latency Sources

Component	Description	Target
Codec Processing	Audio encoding/decoding overhead	μ-law has near-zero overhead (native format)
Network (WebSocket)	Round-trip time between Plivo and your server	< 100ms (deploy server near caller regions)
Speech-to-Text	Time to transcribe audio to text	< 200ms
LLM Processing	Time for AI to generate response	< 500ms
Text-to-Speech	Time to convert text to audio	< 200ms
Total	End-to-end response time	< 1 second

Codec Impact on Latency

Codec	Latency Impact	Notes
`audio/x-mulaw;rate=8000`	Lowest	No transcoding required; native telephony format
`audio/x-l16;rate=8000`	Low	Minimal processing, but larger payload size
`audio/x-l16;rate=16000`	Moderate	Larger payloads; only use if STT model specifically benefits

Best Practices for Low-Latency Voice AI

Use μ-law 8kHz - Avoid unnecessary transcoding
Co-locate your server - Deploy near your expected caller regions (e.g., US East for US traffic)
Use streaming APIs - Choose STT/TTS providers with streaming support
Implement interruption - Use clearAudio to stop playback when user speaks
Optimize LLM calls - Use streaming responses and appropriate model sizes

Plivo routes calls through edge locations closest to the caller. A caller in London connects to Plivo’s London edge, so position your WebSocket server near your expected caller locations.

Basic Implementation

1. Configure Plivo to Stream Audio

Create an XML application that streams audio to your WebSocket:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Speak>Connected to AI Assistant.</Speak>
    <Stream
        keepCallAlive="true"
        bidirectional="true"
        contentType="audio/x-mulaw;rate=8000"
        statusCallbackUrl="https://your-domain.com/stream-status">
        wss://your-domain.com/stream
    </Stream>
</Response>

2. Handle WebSocket Connection

Your server receives the WebSocket connection and processes events:

# Simplified example
async def handle_websocket(websocket):
    async for message in websocket:
        event = json.loads(message)

        if event["event"] == "start":
            # Stream started - initialize AI services
            stream_id = event["start"]["streamId"]

        elif event["event"] == "media":
            # Audio received - send to STT
            audio_bytes = base64.b64decode(event["media"]["payload"])
            transcript = await speech_to_text(audio_bytes)

            if transcript:
                # Get AI response
                response = await get_llm_response(transcript)

                # Convert to speech and send back
                audio = await text_to_speech(response)
                await websocket.send(json.dumps({
                    "event": "playAudio",
                    "media": {
                        "contentType": "audio/x-mulaw",
                        "sampleRate": 8000,
                        "payload": base64.b64encode(audio).decode()
                    }
                }))

Next Steps

Audio Streaming Guide

Complete documentation: XML configuration, WebSocket protocol, APIs, callbacks, signature validation, and code examples

Best Practices

Troubleshooting tips and optimization recommendations

Plivo Stream SDK

Official SDKs for Python, Node.js, and Java with built-in audio handling

Pipecat Integration

Build with Pipecat framework for higher-level abstraction

Voice API Overview - Core voice platform concepts
Voice API Reference - Complete API documentation
XML Reference - All XML elements for call control

Concepts

Integration Guides

API Reference

XML Reference

Troubleshooting

Building AI Voice Agents with Audio Streaming

Prerequisites

Voice API Basics

What is Audio Streaming?

How It Works

Stream Directions

Inbound Stream (Unidirectional)

Bidirectional Stream

Supported Audio Formats

Latency Considerations

Latency Sources

Codec Impact on Latency

Best Practices for Low-Latency Voice AI

Basic Implementation

1. Configure Plivo to Stream Audio

2. Handle WebSocket Connection

Next Steps

Audio Streaming Guide

Best Practices

Plivo Stream SDK

Pipecat Integration

Concepts

Integration Guides

API Reference

XML Reference

Troubleshooting

​Prerequisites

​Voice API Basics

​What is Audio Streaming?

​How It Works

​Stream Directions

​Inbound Stream (Unidirectional)

​Bidirectional Stream

​Supported Audio Formats

​Latency Considerations

​Latency Sources

​Codec Impact on Latency

​Best Practices for Low-Latency Voice AI

​Basic Implementation

​1. Configure Plivo to Stream Audio

​2. Handle WebSocket Connection

​Next Steps

Audio Streaming Guide

Best Practices

Plivo Stream SDK

Pipecat Integration

​Related

Prerequisites

Voice API Basics

What is Audio Streaming?

How It Works

Stream Directions

Inbound Stream (Unidirectional)

Bidirectional Stream

Supported Audio Formats

Latency Considerations

Latency Sources

Codec Impact on Latency

Best Practices for Low-Latency Voice AI

Basic Implementation

1. Configure Plivo to Stream Audio

2. Handle WebSocket Connection

Next Steps

Related