Building a Completely Local Voice AI Agent: LiveKit, VAPI, and the Voice-Native Revolution

By Prahlad Menon Published 2026-02-22 8 min read

Building a Completely Local Voice AI Agent

Voice AI is having a moment. VAPI raised $20M to build voice agents as a service. LiveKit open-sourced their entire agent framework. And NVIDIA just dropped PersonaPlex, a model that makes traditional STT→LLM→TTS pipelines look like antiques.

If you want to build voice AI that runs entirely on your hardware—no cloud API calls, full privacy, telephony-capable—this guide covers everything from turn-key Docker setups to the cutting edge of voice-native models.

The Local Voice AI Stack

local-voice-ai is the fastest way to get a fully local voice assistant running. It’s a Docker Compose setup that wires together:

LiveKit for WebRTC real-time audio and rooms
LiveKit Agents (Python) to orchestrate the STT→LLM→TTS pipeline
Nemotron Speech (default) or Whisper for speech-to-text
llama.cpp running Qwen3-4B for the LLM
Kokoro for text-to-speech synthesis
Next.js frontend UI

Getting Started

# Clone the repo
git clone https://github.com/ShayneP/local-voice-ai.git
cd local-voice-ai

# Start everything (will prompt for CPU or GPU)
./compose-up.sh  # Mac/Linux
./compose-up.ps1  # Windows

Visit http://localhost:3000 and start talking. First run downloads several GB of models—expect 10+ minutes on decent hardware.

Requirements:

Docker + Docker Compose
No GPU required (CPU works, but slower)
12GB+ RAM recommended

The architecture is modular. Each component runs in its own container and communicates over OpenAI-compatible APIs, so you can swap out any piece:

Component	Default	Alternatives
STT	Nemotron Speech	Whisper, Deepgram
LLM	Qwen3-4B via llama.cpp	Any GGUF model
TTS	Kokoro	Cartesia, ElevenLabs
Transport	LiveKit	—

LiveKit vs VAPI: The Core Trade-off

Both platforms build voice AI agents. The difference is philosophy.

LiveKit is open-source LEGOs. You get primitives—rooms, participants, tracks—and build whatever you want. Full customization, self-hostable, but requires more engineering.

VAPI is turnkey Playmobil. Opinionated, closed-source, faster to deploy common patterns (appointment booking, customer service), but less flexible for edge cases.

Feature Comparison

Feature	LiveKit	VAPI
Open Source	✅ Fully OSS	❌ Closed source
Self-hosting	✅ Run your own servers	❌ Cloud only
Video support	✅ Full WebRTC video	❌ Audio only
Telephony	✅ SIP/PSTN integration	✅ Strong focus
Turn detection	✅ Semantic transformer model	✅ “Smart endpointing”
Multi-participant	✅ Full room support	❌ 1:1 focused
Pricing	Free tier + usage	$0.05/min + providers

Pricing Reality Check

VAPI’s $0.05/min base fee sounds cheap until you add provider costs. Real-world estimates put total VAPI costs at $0.13–0.33/min once you include LLM, STT, and TTS.

LiveKit Cloud’s free tier includes 1,000 minutes. Beyond that, you’re paying for compute—but you can self-host entirely if you want to own the infrastructure.

For ~3,000 min/month, community estimates put:

Retell AI: ~$275–320/mo (transparent pricing)
VAPI: ~$370–500+/mo (add-ons unpredictable)
LiveKit Cloud: Varies by usage pattern
Self-hosted: Just your infrastructure costs

LiveKit Telephony: The Local VAPI

LiveKit isn’t just WebRTC—it’s a full telephony stack. You can:

Receive inbound calls via SIP trunks
Make outbound calls programmatically
Route calls to AI agents automatically
Transfer calls to humans when needed

How It Works

Get a phone number from LiveKit Phone Numbers or a SIP provider (Twilio, Telnyx, Plivo)
Create an inbound trunk to receive calls
Define dispatch rules that route callers to LiveKit rooms
Run your agent that joins the room and handles the conversation

from livekit.agents import Agent, AgentSession, JobContext

@server.rtc_session()
async def entrypoint(ctx: JobContext):
    session = AgentSession(
        stt=inference.STT("deepgram/nova-3"),
        llm=inference.LLM("openai/gpt-4.1-mini"),
        tts=inference.TTS("cartesia/sonic-3"),
    )
    
    agent = Agent(
        instructions="You are a helpful phone assistant.",
        tools=[transfer_to_human, lookup_account],
    )
    
    await session.start(agent=agent, room=ctx.room)

LiveKit handles DTMF tones, call transfers (cold and warm), SIP REFER, and integrates Krisp noise cancellation for noisy environments.

Supported SIP providers: Twilio, Telnyx, Exotel, Plivo, Wavix

This is genuinely “local VAPI”—the same capabilities, but you control the infrastructure.

The Latency Problem (And How to Solve It)

Traditional voice AI has a fundamental latency problem:

User speaks → STT (200-400ms) → LLM (500-2000ms) → TTS (200-400ms) → User hears
                                Total: 900-2800ms

That’s noticeable. It’s why conversations with voice assistants feel robotic—awkward pauses, no natural interruptions, weird turn-taking.

There are two approaches to solving this: optimize the cascade or skip it entirely.

Flux: Optimizing the Cascade

Deepgram Flux is the first STT model built specifically for voice agents. Instead of just transcribing words, Flux understands conversational flow:

Key features:

~260ms end-of-turn detection — Knows when speakers finish talking
EagerEndOfTurn events — Start LLM processing before the user fully finishes speaking
Built-in barge-in handling — Natural interruptions without VAD hacks
Turn-based transcripts — Clean conversation structure, not word soup
Nova-3 accuracy — Best-in-class transcription

The clever trick is EagerEndOfTurn: Flux detects when a user is probably done (based on prosody and semantics) and fires an early event. Your agent can start generating a response speculatively. If the user keeps talking, Flux sends a TurnResumed event and you cancel the draft.

# Flux with eager turn detection
async with client.listen.v2.connect(
    model="flux-general-en",
    eager_eot_threshold=0.5,  # Fire early events at 50% confidence
    eot_threshold=0.7,        # Confirm turn at 70% confidence
) as connection:
    # Handle EagerEndOfTurn → start LLM call
    # Handle TurnResumed → cancel draft
    # Handle EndOfTurn → send final response

This shaves 200-400ms off the cascade by parallelizing STT completion with LLM generation.

Trade-off: EagerEndOfTurn can increase LLM API calls by 50-70% due to speculative generation. Worth it for latency-critical applications.

Voice-Native Models: Skip the Cascade Entirely

PersonaPlex: NVIDIA’s Audio-Native Model

PersonaPlex works directly with audio tokens. No ASR→LLM→TTS pipeline. The model listens and speaks simultaneously in a dual-stream configuration.

Results:

Turn-taking latency: 170ms
Interruption latency: 240ms
Full-duplex (listens while speaking)
Natural backchanneling (“uh-huh”, “okay”)

Built on the Moshi architecture with 7B parameters, PersonaPlex can:

Voice prompt: Clone any voice from a sample
Text prompt: Define any persona or role
Handle interruptions: Like a human conversation

The API is available now. Think of it as “conversation as a service” rather than “voice pipeline as a service.”

Moshi: Open-Source Full-Duplex

Moshi from Kyutai Labs is the open-source predecessor to PersonaPlex:

7.6B parameters, runs on-device
160ms theoretical latency, 200ms practical
Full-duplex: listens and speaks simultaneously
Uses Mimi, a neural audio codec for streaming
Fully open-source (Apache 2.0)

# Run Moshi locally
pip install moshi
python -m moshi.server

This is the foundation both PersonaPlex and Helium (Kyutai’s newer model) are built on.

Ultravox: Audio-Native LLM

Ultravox from Fixie takes a different approach: extend any open-weight LLM with a multimodal projector that converts audio directly into the LLM’s embedding space.

Built on Llama 3, Mistral, or Gemma
Audio goes directly to the LLM—no separate ASR
Can understand paralinguistic cues (timing, emotion)
Available on HuggingFace in multiple sizes (8B, 70B)

Think of it as “an LLM that can hear.” It processes speech natively rather than converting to text first.

Architecture Comparison

Approach	Latency	Customization	Self-hostable	Open Source
Traditional (STT→LLM→TTS)	900-2800ms	Full control over each component	✅ Yes	✅ Yes
LiveKit local-voice-ai	500-1500ms	Swap any component	✅ Yes	✅ Yes
Deepgram Flux + cascade	400-800ms	Optimized turn detection	❌ Cloud STT	❌ No
VAPI	400-1000ms	Limited	❌ No	❌ No
PersonaPlex	~170ms	Voice + persona prompting	❌ API only	❌ No
Moshi	~200ms	Full model access	✅ Yes	✅ Yes
Ultravox	~300ms	Fine-tune the model	✅ Yes	✅ Yes

Which Should You Use?

Use local-voice-ai / LiveKit if:

You need full control and self-hosting
Privacy is critical (healthcare, finance)
You want telephony integration
You’re building something custom

Use VAPI if:

You want fast deployment for common patterns
You don’t have engineering resources for infrastructure
Telephony is your primary use case

Use PersonaPlex/Moshi if:

Sub-200ms latency is critical
Natural conversation dynamics matter
You want full-duplex (no turn-taking artifacts)

Use Ultravox if:

You want an audio-native open model
You need to fine-tune on your domain
You want to understand speech paralinguistics

The Future: Voice-Native Is Inevitable

The STT→LLM→TTS cascade is a historical artifact. We built it because we had good ASR, good LLMs, and good TTS—but not good speech-to-speech models.

That’s changing fast. PersonaPlex, Moshi, and Ultravox prove that models can work directly with audio, eliminating the latency and unnaturalness of cascaded systems.

In 2-3 years, the cascade approach will feel as outdated as rule-based chatbots feel today. The voice-native models will just be better.

But for now, if you need production voice AI that runs on your hardware, supports telephony, and gives you full control—LiveKit’s stack is the way to go. And local-voice-ai gets you there in a single docker compose up.

Links:

local-voice-ai — Docker-based local voice assistant
LiveKit Agents — Open-source voice AI framework
LiveKit Telephony — SIP/PSTN integration
Deepgram Flux — Conversational STT with turn detection
PersonaPlex — NVIDIA’s 170ms latency voice model
Moshi — Open-source full-duplex model
Ultravox — Audio-native LLM from Fixie