Building a Completely Local Voice AI Agent: LiveKit, VAPI, and the Voice-Native Revolution

By Prahlad Menon 8 min read

Building a Completely Local Voice AI Agent

Voice AI is having a moment. VAPI raised $20M to build voice agents as a service. LiveKit open-sourced their entire agent framework. And NVIDIA just dropped PersonaPlex, a model that makes traditional STT→LLM→TTS pipelines look like antiques.

If you want to build voice AI that runs entirely on your hardware—no cloud API calls, full privacy, telephony-capable—this guide covers everything from turn-key Docker setups to the cutting edge of voice-native models.

The Local Voice AI Stack

local-voice-ai is the fastest way to get a fully local voice assistant running. It’s a Docker Compose setup that wires together:

  • LiveKit for WebRTC real-time audio and rooms
  • LiveKit Agents (Python) to orchestrate the STT→LLM→TTS pipeline
  • Nemotron Speech (default) or Whisper for speech-to-text
  • llama.cpp running Qwen3-4B for the LLM
  • Kokoro for text-to-speech synthesis
  • Next.js frontend UI

Getting Started

# Clone the repo
git clone https://github.com/ShayneP/local-voice-ai.git
cd local-voice-ai

# Start everything (will prompt for CPU or GPU)
./compose-up.sh  # Mac/Linux
./compose-up.ps1  # Windows

Visit http://localhost:3000 and start talking. First run downloads several GB of models—expect 10+ minutes on decent hardware.

Requirements:

  • Docker + Docker Compose
  • No GPU required (CPU works, but slower)
  • 12GB+ RAM recommended

The architecture is modular. Each component runs in its own container and communicates over OpenAI-compatible APIs, so you can swap out any piece:

ComponentDefaultAlternatives
STTNemotron SpeechWhisper, Deepgram
LLMQwen3-4B via llama.cppAny GGUF model
TTSKokoroCartesia, ElevenLabs
TransportLiveKit

LiveKit vs VAPI: The Core Trade-off

Both platforms build voice AI agents. The difference is philosophy.

LiveKit is open-source LEGOs. You get primitives—rooms, participants, tracks—and build whatever you want. Full customization, self-hostable, but requires more engineering.

VAPI is turnkey Playmobil. Opinionated, closed-source, faster to deploy common patterns (appointment booking, customer service), but less flexible for edge cases.

Feature Comparison

FeatureLiveKitVAPI
Open Source✅ Fully OSS❌ Closed source
Self-hosting✅ Run your own servers❌ Cloud only
Video support✅ Full WebRTC video❌ Audio only
Telephony✅ SIP/PSTN integration✅ Strong focus
Turn detection✅ Semantic transformer model✅ “Smart endpointing”
Multi-participant✅ Full room support❌ 1:1 focused
PricingFree tier + usage$0.05/min + providers

Pricing Reality Check

VAPI’s $0.05/min base fee sounds cheap until you add provider costs. Real-world estimates put total VAPI costs at $0.13–0.33/min once you include LLM, STT, and TTS.

LiveKit Cloud’s free tier includes 1,000 minutes. Beyond that, you’re paying for compute—but you can self-host entirely if you want to own the infrastructure.

For ~3,000 min/month, community estimates put:

  • Retell AI: ~$275–320/mo (transparent pricing)
  • VAPI: ~$370–500+/mo (add-ons unpredictable)
  • LiveKit Cloud: Varies by usage pattern
  • Self-hosted: Just your infrastructure costs

LiveKit Telephony: The Local VAPI

LiveKit isn’t just WebRTC—it’s a full telephony stack. You can:

  • Receive inbound calls via SIP trunks
  • Make outbound calls programmatically
  • Route calls to AI agents automatically
  • Transfer calls to humans when needed

How It Works

  1. Get a phone number from LiveKit Phone Numbers or a SIP provider (Twilio, Telnyx, Plivo)
  2. Create an inbound trunk to receive calls
  3. Define dispatch rules that route callers to LiveKit rooms
  4. Run your agent that joins the room and handles the conversation
from livekit.agents import Agent, AgentSession, JobContext

@server.rtc_session()
async def entrypoint(ctx: JobContext):
    session = AgentSession(
        stt=inference.STT("deepgram/nova-3"),
        llm=inference.LLM("openai/gpt-4.1-mini"),
        tts=inference.TTS("cartesia/sonic-3"),
    )
    
    agent = Agent(
        instructions="You are a helpful phone assistant.",
        tools=[transfer_to_human, lookup_account],
    )
    
    await session.start(agent=agent, room=ctx.room)

LiveKit handles DTMF tones, call transfers (cold and warm), SIP REFER, and integrates Krisp noise cancellation for noisy environments.

Supported SIP providers: Twilio, Telnyx, Exotel, Plivo, Wavix

This is genuinely “local VAPI”—the same capabilities, but you control the infrastructure.

The Latency Problem (And How to Solve It)

Traditional voice AI has a fundamental latency problem:

User speaks → STT (200-400ms) → LLM (500-2000ms) → TTS (200-400ms) → User hears
                                Total: 900-2800ms

That’s noticeable. It’s why conversations with voice assistants feel robotic—awkward pauses, no natural interruptions, weird turn-taking.

There are two approaches to solving this: optimize the cascade or skip it entirely.

Flux: Optimizing the Cascade

Deepgram Flux is the first STT model built specifically for voice agents. Instead of just transcribing words, Flux understands conversational flow:

Key features:

  • ~260ms end-of-turn detection — Knows when speakers finish talking
  • EagerEndOfTurn events — Start LLM processing before the user fully finishes speaking
  • Built-in barge-in handling — Natural interruptions without VAD hacks
  • Turn-based transcripts — Clean conversation structure, not word soup
  • Nova-3 accuracy — Best-in-class transcription

The clever trick is EagerEndOfTurn: Flux detects when a user is probably done (based on prosody and semantics) and fires an early event. Your agent can start generating a response speculatively. If the user keeps talking, Flux sends a TurnResumed event and you cancel the draft.

# Flux with eager turn detection
async with client.listen.v2.connect(
    model="flux-general-en",
    eager_eot_threshold=0.5,  # Fire early events at 50% confidence
    eot_threshold=0.7,        # Confirm turn at 70% confidence
) as connection:
    # Handle EagerEndOfTurn → start LLM call
    # Handle TurnResumed → cancel draft
    # Handle EndOfTurn → send final response

This shaves 200-400ms off the cascade by parallelizing STT completion with LLM generation.

Trade-off: EagerEndOfTurn can increase LLM API calls by 50-70% due to speculative generation. Worth it for latency-critical applications.

Voice-Native Models: Skip the Cascade Entirely

PersonaPlex: NVIDIA’s Audio-Native Model

PersonaPlex works directly with audio tokens. No ASR→LLM→TTS pipeline. The model listens and speaks simultaneously in a dual-stream configuration.

Results:

  • Turn-taking latency: 170ms
  • Interruption latency: 240ms
  • Full-duplex (listens while speaking)
  • Natural backchanneling (“uh-huh”, “okay”)

Built on the Moshi architecture with 7B parameters, PersonaPlex can:

  • Voice prompt: Clone any voice from a sample
  • Text prompt: Define any persona or role
  • Handle interruptions: Like a human conversation

The API is available now. Think of it as “conversation as a service” rather than “voice pipeline as a service.”

Moshi: Open-Source Full-Duplex

Moshi from Kyutai Labs is the open-source predecessor to PersonaPlex:

  • 7.6B parameters, runs on-device
  • 160ms theoretical latency, 200ms practical
  • Full-duplex: listens and speaks simultaneously
  • Uses Mimi, a neural audio codec for streaming
  • Fully open-source (Apache 2.0)
# Run Moshi locally
pip install moshi
python -m moshi.server

This is the foundation both PersonaPlex and Helium (Kyutai’s newer model) are built on.

Ultravox: Audio-Native LLM

Ultravox from Fixie takes a different approach: extend any open-weight LLM with a multimodal projector that converts audio directly into the LLM’s embedding space.

  • Built on Llama 3, Mistral, or Gemma
  • Audio goes directly to the LLM—no separate ASR
  • Can understand paralinguistic cues (timing, emotion)
  • Available on HuggingFace in multiple sizes (8B, 70B)

Think of it as “an LLM that can hear.” It processes speech natively rather than converting to text first.

Architecture Comparison

ApproachLatencyCustomizationSelf-hostableOpen Source
Traditional (STT→LLM→TTS)900-2800msFull control over each component✅ Yes✅ Yes
LiveKit local-voice-ai500-1500msSwap any component✅ Yes✅ Yes
Deepgram Flux + cascade400-800msOptimized turn detection❌ Cloud STT❌ No
VAPI400-1000msLimited❌ No❌ No
PersonaPlex~170msVoice + persona prompting❌ API only❌ No
Moshi~200msFull model access✅ Yes✅ Yes
Ultravox~300msFine-tune the model✅ Yes✅ Yes

Which Should You Use?

Use local-voice-ai / LiveKit if:

  • You need full control and self-hosting
  • Privacy is critical (healthcare, finance)
  • You want telephony integration
  • You’re building something custom

Use VAPI if:

  • You want fast deployment for common patterns
  • You don’t have engineering resources for infrastructure
  • Telephony is your primary use case

Use PersonaPlex/Moshi if:

  • Sub-200ms latency is critical
  • Natural conversation dynamics matter
  • You want full-duplex (no turn-taking artifacts)

Use Ultravox if:

  • You want an audio-native open model
  • You need to fine-tune on your domain
  • You want to understand speech paralinguistics

The Future: Voice-Native Is Inevitable

The STT→LLM→TTS cascade is a historical artifact. We built it because we had good ASR, good LLMs, and good TTS—but not good speech-to-speech models.

That’s changing fast. PersonaPlex, Moshi, and Ultravox prove that models can work directly with audio, eliminating the latency and unnaturalness of cascaded systems.

In 2-3 years, the cascade approach will feel as outdated as rule-based chatbots feel today. The voice-native models will just be better.

But for now, if you need production voice AI that runs on your hardware, supports telephony, and gives you full control—LiveKit’s stack is the way to go. And local-voice-ai gets you there in a single docker compose up.

Links: