ViMax: Open-Source Agentic Video Generation — Director, Writer, and Producer in One
Most AI video tools generate a few seconds of footage from a text prompt. You get a clip. Maybe it looks good. But the character changes between shots, there’s no narrative structure, and stitching anything longer than 10 seconds into a coherent story means hours of manual work.
ViMax takes a fundamentally different approach. Instead of a single model generating clips, it deploys a team of AI agents — director, screenwriter, producer, and video generator — that collaborate to produce multi-shot, narratively coherent video from nothing more than a text idea.
Built by the HKU Data Science Lab, ViMax has hit 3,500+ stars on GitHub and is MIT licensed.
What ViMax Actually Does
ViMax is a multi-agent orchestration framework for video production. You provide an idea (or a script, or a novel), and the agents handle the rest:
- Screenwriter Agent — Generates a structured, multi-scene script from your concept
- Storyboard Agent — Designs shot-level storyboards using cinematography language, establishing narrative rhythm
- Character Designer — Creates and maintains consistent character reference images across all scenes
- Director Agent — Simulates multi-camera filming, managing character positioning and backgrounds
- Producer Agent — Orchestrates the pipeline, handles quality checks, ensures consistency
The result: multi-shot video with consistent characters, coherent scenes, and actual storytelling structure.
Three Production Modes
Idea2Video
Feed it a concept: “If a cat and a dog are best friends, what would happen when they meet a new cat?” — and ViMax writes the script, designs the characters, storyboards every shot, and generates the video.
Script2Video
Already have a screenplay? Drop it in. ViMax handles the visual production — storyboarding, character consistency, multi-camera simulation, and rendering.
Novel2Video
The most ambitious mode. Feed ViMax an entire novel and it performs intelligent narrative compression, extracts key scenes, tracks characters across chapters, and produces episodic video content.
The Architecture: Why It Works
ViMax solves the consistency problem that plagues every other AI video tool through several technical innovations:
- RAG-based script engine — Analyzes long-form text and segments it into multi-scene scripts while preserving key plot points and dialogue
- Multi-camera filming simulation — Maintains consistent character positioning and backgrounds within scenes, creating an immersive viewing experience
- Intelligent reference image selection — For each new shot, the system selects appropriate reference images from previous timeline events, ensuring character and environment accuracy as videos get longer
- Automated consistency checking — Every generated image is verified against established references before being used in the final video
What Powers It (Not What You’d Expect)
Here’s the important nuance: ViMax is not a video model. It’s an orchestration layer that coordinates existing AI services:
| Component | Default Provider | Purpose |
|---|---|---|
| Chat/Planning | Gemini 2.5 Flash (via OpenRouter) | Script generation, storyboarding, agent reasoning |
| Image Generation | Google Imagen (Nanobanana API) | Character designs, storyboard frames |
| Video Generation | Google Veo API | Final video rendering |
You can also swap in MiniMax models (M2.7 with 1M token context) as an alternative chat provider, and the architecture is flexible enough to support other backends.
Cost reality: The code is free (MIT). The API calls are not. Google Veo is rate-limited to ~10 requests/day on free tiers. For production use, you’ll need API budgets — but we’re talking cents per video, not $100/month subscriptions to Midjourney + HeyGen + Runway.
Quick Start
ViMax uses uv for dependency management. Setup is minimal:
git clone https://github.com/HKUDS/ViMax.git
cd ViMax
uv sync
Configure your API keys in configs/idea2video.yaml:
chat_model:
init_args:
model: google/gemini-2.5-flash-lite-preview-09-2025
model_provider: openai
api_key: <YOUR_KEY>
base_url: https://openrouter.ai/api/v1
image_generator:
class_path: tools.ImageGeneratorNanobananaGoogleAPI
init_args:
api_key: <YOUR_KEY>
video_generator:
class_path: tools.VideoGeneratorVeoGoogleAPI
init_args:
api_key: <YOUR_KEY>
Then edit your idea in main_idea2video.py and run it. The agents take over from there.
Who Should Use This
Content creators tired of paying $100+/month across multiple AI video tools. ViMax consolidates the entire pipeline.
Educators who need to produce lecture-style video content at scale. Feed it your curriculum outline; get structured video lectures with consistent visual style.
Developers building video-generation features into products. MIT license means you can integrate and modify freely.
Storytellers who want to prototype visual narratives before committing to full production.
How It Compares
| Tool | What You Get | Cost | Consistency |
|---|---|---|---|
| Runway Gen-4 | Single clips, manual stitching | $15-76/mo | Per-clip only |
| HeyGen | Avatar-based talking head videos | $24-180/mo | Good for avatars |
| Midjourney + manual | Images, then separate video tools | $10-60/mo + video tool | Manual effort |
| ViMax | End-to-end multi-shot video with narrative | API costs (~$1-5/video) | Built-in consistency engine |
The key difference: ViMax is the only tool where you input an idea and get back a story. Everything else gives you clips that you manually assemble.
Limitations to Know
- API dependency — You need Google/OpenRouter API keys. This isn’t running locally on your GPU.
- Rate limits — Veo API caps at ~10 video requests/day on free tiers. Production use needs paid quotas.
- Quality ceiling — Output quality is bounded by Veo/Imagen. It’s good, but it’s not Hollywood.
- Early project — The codebase is young. Expect rough edges, especially with Novel2Video on very long texts.
The Bottom Line
ViMax represents a shift in how AI video generation works. Instead of throwing a bigger model at the problem, it throws better coordination at it — multiple specialized agents that plan, design, check, and produce together.
The result is the first open-source tool that can take a text idea and produce a multi-shot, narratively coherent video with consistent characters. No subscriptions. No vendor lock-in. Just agents doing what agents do best: breaking a complex task into manageable pieces and solving them systematically.
Links:
- GitHub Repository
- Demo Videos
- License: MIT