What is NVIDIA PersonaPlex?

PersonaPlex is a 7B-parameter real-time speech-to-speech conversational model from NVIDIA. It supports full-duplex conversation (listening and speaking simultaneously), voice cloning from audio samples, and persona control via text prompts. It's based on the Moshi architecture and released under the MIT license.

What GPU do I need to run PersonaPlex?

PersonaPlex is a 7B model, so it needs approximately 14-16GB of VRAM for full GPU inference. An RTX 4090 (24GB), A100, or similar works well. For GPUs with less VRAM, use the --cpu-offload flag to offload layers to system RAM, though this increases latency. You can also run offline evaluation on CPU only.

Can PersonaPlex clone my voice?

Yes. PersonaPlex supports voice conditioning via audio samples. You provide a short audio clip of the target voice, and the model will adopt those vocal characteristics during conversation. It also includes 18 pre-built voice embeddings (8 natural, 10 variety) for immediate use.

What does full-duplex mean in PersonaPlex?

Full-duplex means PersonaPlex listens and speaks at the same time, like a real human conversation. It handles interruptions, barge-ins, overlapping speech, and rapid turn-taking without waiting for you to finish before responding. This is different from most voice AI which operates in half-duplex (listen, then respond).

Is PersonaPlex free to use commercially?

The code is MIT licensed and the model on HuggingFace states it is ready for commercial use. However, you must accept the model license on HuggingFace before downloading. Always review the specific license terms for your use case.

How is PersonaPlex different from OpenAI's voice mode or ElevenLabs?

PersonaPlex runs entirely locally — no API calls, no per-minute billing, no data leaving your machine. It's a single model that handles both understanding and generation (speech-to-speech), rather than a pipeline of separate ASR + LLM + TTS models. It also supports persona control and voice cloning natively.

Can I run PersonaPlex on a Mac?

PersonaPlex depends on CUDA for GPU acceleration, so it doesn't natively support Apple Silicon. You can run offline evaluation on CPU-only PyTorch, but real-time conversation requires an NVIDIA GPU. Cloud GPU options like RunPod, Lambda, or Vast.ai work as alternatives.

What voices come pre-built with PersonaPlex?

PersonaPlex includes 18 pre-built voice embeddings: 8 natural voices (NATF0-3 female, NATM0-3 male) optimized for conversational tone, and 10 variety voices (VARF0-4 female, VARM0-4 male) with more diverse characteristics.

How to Run NVIDIA PersonaPlex Locally: Full-Duplex Voice AI with Character Control

By Prahlad Menon Published 2026-04-06 5 min read

How to Run NVIDIA PersonaPlex Locally: Full-Duplex Voice AI with Character Control

NVIDIA just open-sourced PersonaPlex — a 7B speech-to-speech model that does something no commercial voice API can match: it holds a consistent character while having a real-time, full-duplex conversation. You talk over it, it adapts. You give it a persona, it stays in character. You give it a voice sample, it sounds like that person.

MIT licensed. Runs on a single GPU. Here’s how to set it up.

What You’re Getting

PersonaPlex isn’t a TTS engine or a voice assistant wrapper. It’s a single model that simultaneously:

Listens to your speech in real-time
Speaks back while you’re still talking (full-duplex)
Maintains a persona defined by a text prompt
Clones a voice from an audio sample
Handles interruptions, barge-ins, and overlapping speech naturally

It’s built on the Moshi architecture from Kyutai and fine-tuned by NVIDIA on synthetic + real conversation data. The key insight: rather than chaining ASR → LLM → TTS (the way most voice assistants work), PersonaPlex does everything in one pass through a single 7B model. Lower latency, more natural flow.

Prerequisites

Requirement	Details
GPU	NVIDIA GPU with 16GB+ VRAM (RTX 4090, A100, etc.)
CPU fallback	`--cpu-offload` flag for lower VRAM; pure CPU for offline only
OS	Linux (Ubuntu/Debian or Fedora/RHEL)
Python	3.10+
HuggingFace account	Free — needed to accept model license
Disk	~15GB for model weights

No NVIDIA GPU? You can rent one on RunPod, Lambda, or Vast.ai for $0.30–1.50/hr. An A100 40GB instance is ideal.

Step 1: Install System Dependencies

PersonaPlex uses the Opus audio codec for real-time streaming. Install the development library:

# Ubuntu/Debian
sudo apt update && sudo apt install -y libopus-dev git

# Fedora/RHEL
sudo dnf install -y opus-devel git

Step 2: Clone the Repository

git clone https://github.com/NVIDIA/personaplex.git
cd personaplex

Step 3: Set Up Python Environment

Create an isolated environment to avoid dependency conflicts:

python -m venv venv
source venv/bin/activate
pip install --upgrade pip

Install PersonaPlex (it’s packaged as moshi):

pip install moshi/.

For Blackwell GPUs (RTX 5090, B100, etc.): You need the CUDA 13.0 PyTorch build:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

For CPU offloading (if your GPU has less than 16GB VRAM):

pip install accelerate

Step 4: Get the Model Weights

Go to nvidia/personaplex-7b-v1 on HuggingFace
Accept the model license
Create an access token at huggingface.co/settings/tokens
Set your token:

export HF_TOKEN=hf_your_token_here

The model downloads automatically on first run (~15GB).

Step 5: Launch the Live Server

This is where it gets fun. One command launches a web UI with real-time voice conversation:

SSL_DIR=$(mktemp -d)
python -m moshi.server --ssl "$SSL_DIR"

The server generates temporary SSL certificates (needed for browser microphone access) and starts listening. You’ll see output like:

Access the Web UI directly at https://localhost:8998

Open that URL in your browser, allow microphone access, and start talking.

Low VRAM? Add the offload flag:

SSL_DIR=$(mktemp -d)
python -m moshi.server --ssl "$SSL_DIR" --cpu-offload

Step 6: Try Offline Processing

Don’t have a GPU handy for real-time? You can process pre-recorded audio files:

Basic Assistant Mode

HF_TOKEN=hf_your_token \
python -m moshi.offline \
  --voice-prompt "NATF2.pt" \
  --input-wav "assets/test/input_assistant.wav" \
  --seed 42424242 \
  --output-wav "output.wav" \
  --output-text "output.json"

Customer Service Role

HF_TOKEN=hf_your_token \
python -m moshi.offline \
  --voice-prompt "NATM1.pt" \
  --text-prompt "$(cat assets/test/prompt_service.txt)" \
  --input-wav "assets/test/input_service.wav" \
  --seed 42424242 \
  --output-wav "output.wav" \
  --output-text "output.json"

For CPU-only offline processing, install the CPU PyTorch build and add --cpu-offload.

Understanding Voice Prompts

PersonaPlex ships with 18 pre-built voice embeddings:

Category	Voices	Style
Natural Female	NATF0, NATF1, NATF2, NATF3	Conversational, warm
Natural Male	NATM0, NATM1, NATM2, NATM3	Conversational, natural
Variety Female	VARF0–VARF4	Diverse range of tones
Variety Male	VARM0–VARM4	Diverse range of tones

Use the NAT voices for natural-sounding conversations. The VAR voices offer more character variety. Pass them via --voice-prompt:

--voice-prompt "NATF2.pt"   # Natural female voice 2
--voice-prompt "VARM3.pt"   # Variety male voice 3

Writing Effective Persona Prompts

The --text-prompt flag is where PersonaPlex really differentiates itself. You define the character’s role, knowledge, and personality in plain text.

Simple Assistant

You are a wise and friendly teacher. Answer questions or provide advice in a clear and engaging way.

Customer Service Agent

You work for CitySan Services which is a waste management company and your name is Ayelen Lucero. Information: Verify customer name Omar Torres. Current schedule: every other week. Upcoming pickup: April 12th. Compost bin service available for $8/month add-on.

Creative Character

You enjoy having a good conversation. Have a technical discussion about fixing a reactor core on a spaceship to Mars. You are an astronaut on a Mars mission. Your name is Alex. You are already dealing with a reactor core meltdown. Several ship systems are failing, and continued instability will lead to catastrophic failure. You explain what is happening and urgently ask for help thinking through how to stabilize the reactor.

Tips for Better Prompts

Include specific facts — names, prices, schedules. The model uses these in conversation.
Set the emotional tone — “urgent,” “casual,” “empathetic” changes how it speaks.
Give it constraints — what it knows and doesn’t know prevents hallucination.
Start with “You enjoy having a good conversation” for casual/open-ended chats — this was in the training data and produces the most natural results.

Architecture: How It Works

PersonaPlex uses a dual-stream architecture based on Moshi:

┌─────────────────────────────────────────────────┐
│              PersonaPlex (7B)                    │
│                                                  │
│  ┌──────────┐    ┌───────────┐    ┌──────────┐  │
│  │ Text     │    │ Helium    │    │ Audio    │  │
│  │ Prompt   │───▶│ LLM      │───▶│ Codec   │  │
│  │ (role)   │    │ Backbone  │    │ (Mimi)  │  │
│  └──────────┘    │           │    └────┬─────┘  │
│  ┌──────────┐    │  Dual     │         │        │
│  │ Voice    │───▶│  Stream   │    Output Audio  │
│  │ Prompt   │    │  Decoder  │         │        │
│  │ (audio)  │    └─────┬─────┘         │        │
│  └──────────┘          │               │        │
│                   ┌────┴────┐          │        │
│  Input Audio ───▶│ Encoder │          ▼        │
│  (your voice)    └─────────┘     Speaker Out   │
└─────────────────────────────────────────────────┘

Key design choices:

Single model — no ASR→LLM→TTS pipeline. Speech in, speech out.
Neural codec (Mimi) — encodes audio into tokens the LLM can process.
Full-duplex — separate streams for listening and speaking, processed concurrently.
Helium backbone — the underlying LLM from Kyutai, giving it strong language understanding.

Comparison: PersonaPlex vs Alternatives

Feature	PersonaPlex	OpenAI Voice	ElevenLabs	Moshi (base)
Full-duplex	✅	✅	❌	✅
Self-hosted	✅	❌	❌	✅
Persona control	✅ Text prompt	Limited (system prompt)	❌	❌
Voice cloning	✅ Audio conditioning	❌	✅ API only	❌
License	MIT	Proprietary	Proprietary	CC-BY
Parameters	7B	Unknown	N/A	7B
Cost	Free (your GPU)	Per-minute	Per-character	Free
Latency	Real-time	Real-time	~1s	Real-time

Running on Cloud GPUs

No local GPU? Here’s the fastest path:

RunPod (recommended)

Create a pod with the PyTorch 2.x template and an A100 40GB GPU
SSH in and run the install steps above
Forward port 8998: ssh -L 8998:localhost:8998 your-pod
Open https://localhost:8998 in your browser

Google Colab (limited)

Colab’s free T4 (16GB) may work with --cpu-offload, but don’t expect smooth real-time performance. Better for offline processing.

Troubleshooting

“CUDA out of memory” → Add --cpu-offload to your command. This moves some layers to RAM.

Browser says “Not Secure” → Expected — the SSL certs are self-signed. Click “Advanced” → “Proceed.”

No audio output → Check that your browser has microphone permissions. Chrome works best.

Model download fails → Verify you accepted the license at huggingface.co/nvidia/personaplex-7b-v1 and your HF_TOKEN is set correctly.

Blackwell GPU errors → Install the CUDA 13.0 PyTorch build: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

What to Build With This

PersonaPlex is MIT licensed and commercially ready. Some ideas:

AI receptionist — give it your business info, let it answer calls
Language tutor — set the persona as a patient teacher, practice conversation
Game NPCs — each character gets a unique voice + personality prompt
Customer service training — simulate difficult customer scenarios
Podcast co-host — set a personality and have it riff on topics in real-time
Accessibility — voice interfaces for applications that currently require text

The key advantage over API-based solutions: zero marginal cost per conversation. Once you have the GPU, every additional minute of conversation is free.

PersonaPlex is MIT licensed on GitHub. Paper: arXiv:2602.06053. Model: nvidia/personaplex-7b-v1 on HuggingFace.