LuxTTS: Clone Any Voice in 3 Seconds, Run It on a 4GB GPU

By Prahlad Menon 4 min read

The voice cloning bar just dropped significantly.

LuxTTS clones a voice from 3 seconds of reference audio, runs at 150x realtime speed on a single GPU, and fits entirely within 1GB of VRAM. On CPU — no GPU at all — it still runs faster than realtime. It outputs at 48kHz, double the 24kHz that most TTS models produce.

The model is built on zipvoice, weighs almost nothing, and the code is MIT licensed.

The Numbers That Matter

MetricLuxTTSTypical TTS
VRAM required1GB4–8GB+
Speed (GPU)150x realtime10–30x realtime
Speed (CPU)>1x realtimeSlower than realtime
Output quality48kHz24kHz
Reference audio needed3 seconds5–30 seconds

Running faster than realtime on CPU is the surprising one. It means LuxTTS is genuinely deployable on any machine — no GPU required, no cloud API, no latency from network calls.

How It Works

LuxTTS is built on zipvoice — a lightweight diffusion-based TTS architecture designed for efficiency without sacrificing quality. The approach prioritizes:

  • Compact model size — 1GB VRAM is the entire footprint, not a minimum requirement
  • Few-step generation — 3–4 diffusion steps hits the sweet spot of quality vs. speed
  • Direct waveform output at 48kHz — no upsampling from a lower-quality intermediate

Compared to approaches like VoxCPM’s continuous latent space modeling, LuxTTS trades some of the architectural sophistication for raw accessibility. The goal is voice cloning that anyone can run locally, not pushing the ceiling of naturalness on high-end hardware.

Running It Locally

Install:

git clone https://github.com/ysharma3501/LuxTTS.git
cd LuxTTS
pip install -r requirements.txt

Basic voice clone:

from zipvoice.luxvoice import LuxTTS
import soundfile as sf

# GPU
lux_tts = LuxTTS('YatharthS/LuxTTS', device='cuda')
# CPU
# lux_tts = LuxTTS('YatharthS/LuxTTS', device='cpu', threads=2)
# Mac (MPS)
# lux_tts = LuxTTS('YatharthS/LuxTTS', device='mps')

text = "Whatever you want the cloned voice to say."
prompt_audio = 'your_reference.wav'  # 3+ seconds

encoded_prompt = lux_tts.encode_prompt(prompt_audio, rms=0.01)
final_wav = lux_tts.generate_speech(text, encoded_prompt, num_steps=4)

final_wav = final_wav.numpy().squeeze()
sf.write('output.wav', final_wav, 48000)

Key parameters:

  • num_steps — 3–4 is the efficiency sweet spot; higher improves quality but slows generation
  • t_shift — higher values can improve naturalness but may hurt word accuracy
  • rms — output volume (0.01 recommended)
  • return_smooth=True — helps with metallic artifacts if you hear them
  • ref_duration — how much of the reference audio to use; set lower to speed up inference

No HuggingFace account required for basic use. Model downloads automatically on first run.

Try It Without Installing Anything

Both are free and require no local setup.

Where It Fits

LuxTTS isn’t trying to be the highest-quality TTS model. It’s trying to be the most accessible one that’s still genuinely good.

The 1GB VRAM ceiling means it runs on hardware that couldn’t touch most voice cloning models. The CPU performance means it’s viable in environments where GPU access isn’t guaranteed — edge devices, cheap VMs, developer laptops. The 3-second reference requirement means you don’t need clean studio audio to get a usable clone.

For voice agents, content creation pipelines, accessibility tools, or any application where you need voice cloning without cloud dependency — LuxTTS is the new default to reach for before deciding you need something heavier.

How it compares: NeuTTS and RCLI

The on-device TTS space has gotten busy. Two other options worth knowing:

NeuTTS (by Neuphonic) — LLM backbone + NeuCodec (50Hz neural audio codec, single codebook). Instant voice cloning from 3 seconds of audio. Available in GGUF Q4/Q8 quantization — runs on phones, Raspberry Pi, and CPU-only machines. Multilingual: English, Spanish, French, German. Apache 2.0, watermarked output. This is the ultra-low-resource option: if you need TTS on a Pi or without any GPU at all, NeuTTS is the pick. LuxTTS needs at least 1GB VRAM or a capable CPU; NeuTTS is designed for a step below that.

RCLI’s MetalRT TTS — part of RCLI’s full voice AI pipeline for Apple Silicon. Not a standalone TTS library, but the TTS component of an end-to-end voice assistant. MetalRT is hand-optimized for Apple’s Metal GPU and runs faster-than-realtime on M3+.

The practical decision tree:

ScenarioBest pick
GPU available (1GB+ VRAM), need voice cloningLuxTTS
No GPU, Raspberry Pi or CPU-onlyNeuTTS
Apple Silicon Mac, want full voice pipelineRCLI + MetalRT
Need multilingual TTS on-deviceNeuTTS
Need fastest possible generation on CUDALuxTTS