How to Run NVIDIA PersonaPlex Locally: Full-Duplex Voice AI with Character Control

By Prahlad Menon 5 min read

How to Run NVIDIA PersonaPlex Locally: Full-Duplex Voice AI with Character Control

NVIDIA just open-sourced PersonaPlex β€” a 7B speech-to-speech model that does something no commercial voice API can match: it holds a consistent character while having a real-time, full-duplex conversation. You talk over it, it adapts. You give it a persona, it stays in character. You give it a voice sample, it sounds like that person.

MIT licensed. Runs on a single GPU. Here’s how to set it up.

What You’re Getting

PersonaPlex isn’t a TTS engine or a voice assistant wrapper. It’s a single model that simultaneously:

  • Listens to your speech in real-time
  • Speaks back while you’re still talking (full-duplex)
  • Maintains a persona defined by a text prompt
  • Clones a voice from an audio sample
  • Handles interruptions, barge-ins, and overlapping speech naturally

It’s built on the Moshi architecture from Kyutai and fine-tuned by NVIDIA on synthetic + real conversation data. The key insight: rather than chaining ASR β†’ LLM β†’ TTS (the way most voice assistants work), PersonaPlex does everything in one pass through a single 7B model. Lower latency, more natural flow.

Prerequisites

RequirementDetails
GPUNVIDIA GPU with 16GB+ VRAM (RTX 4090, A100, etc.)
CPU fallback--cpu-offload flag for lower VRAM; pure CPU for offline only
OSLinux (Ubuntu/Debian or Fedora/RHEL)
Python3.10+
HuggingFace accountFree β€” needed to accept model license
Disk~15GB for model weights

No NVIDIA GPU? You can rent one on RunPod, Lambda, or Vast.ai for $0.30–1.50/hr. An A100 40GB instance is ideal.

Step 1: Install System Dependencies

PersonaPlex uses the Opus audio codec for real-time streaming. Install the development library:

# Ubuntu/Debian
sudo apt update && sudo apt install -y libopus-dev git

# Fedora/RHEL
sudo dnf install -y opus-devel git

Step 2: Clone the Repository

git clone https://github.com/NVIDIA/personaplex.git
cd personaplex

Step 3: Set Up Python Environment

Create an isolated environment to avoid dependency conflicts:

python -m venv venv
source venv/bin/activate
pip install --upgrade pip

Install PersonaPlex (it’s packaged as moshi):

pip install moshi/.

For Blackwell GPUs (RTX 5090, B100, etc.): You need the CUDA 13.0 PyTorch build:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

For CPU offloading (if your GPU has less than 16GB VRAM):

pip install accelerate

Step 4: Get the Model Weights

  1. Go to nvidia/personaplex-7b-v1 on HuggingFace
  2. Accept the model license
  3. Create an access token at huggingface.co/settings/tokens
  4. Set your token:
export HF_TOKEN=hf_your_token_here

The model downloads automatically on first run (~15GB).

Step 5: Launch the Live Server

This is where it gets fun. One command launches a web UI with real-time voice conversation:

SSL_DIR=$(mktemp -d)
python -m moshi.server --ssl "$SSL_DIR"

The server generates temporary SSL certificates (needed for browser microphone access) and starts listening. You’ll see output like:

Access the Web UI directly at https://localhost:8998

Open that URL in your browser, allow microphone access, and start talking.

Low VRAM? Add the offload flag:

SSL_DIR=$(mktemp -d)
python -m moshi.server --ssl "$SSL_DIR" --cpu-offload

Step 6: Try Offline Processing

Don’t have a GPU handy for real-time? You can process pre-recorded audio files:

Basic Assistant Mode

HF_TOKEN=hf_your_token \
python -m moshi.offline \
  --voice-prompt "NATF2.pt" \
  --input-wav "assets/test/input_assistant.wav" \
  --seed 42424242 \
  --output-wav "output.wav" \
  --output-text "output.json"

Customer Service Role

HF_TOKEN=hf_your_token \
python -m moshi.offline \
  --voice-prompt "NATM1.pt" \
  --text-prompt "$(cat assets/test/prompt_service.txt)" \
  --input-wav "assets/test/input_service.wav" \
  --seed 42424242 \
  --output-wav "output.wav" \
  --output-text "output.json"

For CPU-only offline processing, install the CPU PyTorch build and add --cpu-offload.

Understanding Voice Prompts

PersonaPlex ships with 18 pre-built voice embeddings:

CategoryVoicesStyle
Natural FemaleNATF0, NATF1, NATF2, NATF3Conversational, warm
Natural MaleNATM0, NATM1, NATM2, NATM3Conversational, natural
Variety FemaleVARF0–VARF4Diverse range of tones
Variety MaleVARM0–VARM4Diverse range of tones

Use the NAT voices for natural-sounding conversations. The VAR voices offer more character variety. Pass them via --voice-prompt:

--voice-prompt "NATF2.pt"   # Natural female voice 2
--voice-prompt "VARM3.pt"   # Variety male voice 3

Writing Effective Persona Prompts

The --text-prompt flag is where PersonaPlex really differentiates itself. You define the character’s role, knowledge, and personality in plain text.

Simple Assistant

You are a wise and friendly teacher. Answer questions or provide advice in a clear and engaging way.

Customer Service Agent

You work for CitySan Services which is a waste management company and your name is Ayelen Lucero. Information: Verify customer name Omar Torres. Current schedule: every other week. Upcoming pickup: April 12th. Compost bin service available for $8/month add-on.

Creative Character

You enjoy having a good conversation. Have a technical discussion about fixing a reactor core on a spaceship to Mars. You are an astronaut on a Mars mission. Your name is Alex. You are already dealing with a reactor core meltdown. Several ship systems are failing, and continued instability will lead to catastrophic failure. You explain what is happening and urgently ask for help thinking through how to stabilize the reactor.

Tips for Better Prompts

  1. Include specific facts β€” names, prices, schedules. The model uses these in conversation.
  2. Set the emotional tone β€” β€œurgent,” β€œcasual,” β€œempathetic” changes how it speaks.
  3. Give it constraints β€” what it knows and doesn’t know prevents hallucination.
  4. Start with β€œYou enjoy having a good conversation” for casual/open-ended chats β€” this was in the training data and produces the most natural results.

Architecture: How It Works

PersonaPlex uses a dual-stream architecture based on Moshi:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              PersonaPlex (7B)                    β”‚
β”‚                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Text     β”‚    β”‚ Helium    β”‚    β”‚ Audio    β”‚  β”‚
β”‚  β”‚ Prompt   │───▢│ LLM      │───▢│ Codec   β”‚  β”‚
β”‚  β”‚ (role)   β”‚    β”‚ Backbone  β”‚    β”‚ (Mimi)  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚           β”‚    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚  Dual     β”‚         β”‚        β”‚
β”‚  β”‚ Voice    │───▢│  Stream   β”‚    Output Audio  β”‚
β”‚  β”‚ Prompt   β”‚    β”‚  Decoder  β”‚         β”‚        β”‚
β”‚  β”‚ (audio)  β”‚    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜         β”‚        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚               β”‚        β”‚
β”‚                   β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”          β”‚        β”‚
β”‚  Input Audio ───▢│ Encoder β”‚          β–Ό        β”‚
β”‚  (your voice)    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     Speaker Out   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key design choices:

  • Single model β€” no ASRβ†’LLMβ†’TTS pipeline. Speech in, speech out.
  • Neural codec (Mimi) β€” encodes audio into tokens the LLM can process.
  • Full-duplex β€” separate streams for listening and speaking, processed concurrently.
  • Helium backbone β€” the underlying LLM from Kyutai, giving it strong language understanding.

Comparison: PersonaPlex vs Alternatives

FeaturePersonaPlexOpenAI VoiceElevenLabsMoshi (base)
Full-duplexβœ…βœ…βŒβœ…
Self-hostedβœ…βŒβŒβœ…
Persona controlβœ… Text promptLimited (system prompt)❌❌
Voice cloningβœ… Audio conditioningβŒβœ… API only❌
LicenseMITProprietaryProprietaryCC-BY
Parameters7BUnknownN/A7B
CostFree (your GPU)Per-minutePer-characterFree
LatencyReal-timeReal-time~1sReal-time

Running on Cloud GPUs

No local GPU? Here’s the fastest path:

  1. Create a pod with the PyTorch 2.x template and an A100 40GB GPU
  2. SSH in and run the install steps above
  3. Forward port 8998: ssh -L 8998:localhost:8998 your-pod
  4. Open https://localhost:8998 in your browser

Google Colab (limited)

Colab’s free T4 (16GB) may work with --cpu-offload, but don’t expect smooth real-time performance. Better for offline processing.

Troubleshooting

β€œCUDA out of memory” β†’ Add --cpu-offload to your command. This moves some layers to RAM.

Browser says β€œNot Secure” β†’ Expected β€” the SSL certs are self-signed. Click β€œAdvanced” β†’ β€œProceed.”

No audio output β†’ Check that your browser has microphone permissions. Chrome works best.

Model download fails β†’ Verify you accepted the license at huggingface.co/nvidia/personaplex-7b-v1 and your HF_TOKEN is set correctly.

Blackwell GPU errors β†’ Install the CUDA 13.0 PyTorch build: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

What to Build With This

PersonaPlex is MIT licensed and commercially ready. Some ideas:

  • AI receptionist β€” give it your business info, let it answer calls
  • Language tutor β€” set the persona as a patient teacher, practice conversation
  • Game NPCs β€” each character gets a unique voice + personality prompt
  • Customer service training β€” simulate difficult customer scenarios
  • Podcast co-host β€” set a personality and have it riff on topics in real-time
  • Accessibility β€” voice interfaces for applications that currently require text

The key advantage over API-based solutions: zero marginal cost per conversation. Once you have the GPU, every additional minute of conversation is free.


PersonaPlex is MIT licensed on GitHub. Paper: arXiv:2602.06053. Model: nvidia/personaplex-7b-v1 on HuggingFace.