Qwen3-Coder-Next: 3B Active Params, Beats Models 20x Its Size
The claim sounds like marketing: 3 billion active parameters beating models with 60 billion. But Qwen3-Coder-Next has 1.2 million downloads on HuggingFace and the benchmarks hold up.
Qwen3-Coder-Next is an 80B Mixture of Experts model where only 3B parameters activate per token. It was built specifically for agentic coding — not chat, not reasoning in the abstract, but the actual work of reading a codebase, calling tools, writing fixes, and recovering when something breaks.
Alongside it: Qwen Code — an open-source terminal coding agent that just crossed 20,900 GitHub stars.
Why 3B Active Params Can Beat 30-60B Models
The architecture explanation starts with MoE, but that’s not the whole story.
Standard transformers activate every parameter for every token. MoE models route each token to a subset of “expert” networks — Qwen3-Coder-Next has 512 experts and activates 10 per token (plus 1 shared expert). So the 80B parameter count is capacity; the 3B is the actual compute cost per inference step.
But MoE alone doesn’t explain the benchmark results. The second part is the hybrid attention architecture — the same Gated DeltaNet design that Qwen3.5 introduced. Three out of every four layers use Gated DeltaNet (linear attention — scales linearly with sequence length). The fourth uses standard Gated Attention. At 256K context, this is dramatically cheaper than full quadratic attention on every layer.
The third part is specialization. Qwen3-Coder-Next was trained specifically for:
- Long-horizon reasoning — multi-step coding tasks where the model needs to track state across many actions
- Complex tool usage — calling tools correctly, chaining them, handling errors
- Recovery from execution failures — when a tool call fails or produces unexpected output, adjusting rather than giving up
These are exactly the capabilities SWE-Bench-Pro tests — not “complete this function” but “fix this real GitHub issue,” which requires all three.
The Architecture Numbers
Total parameters: 80B
Active per token: 3B (10 experts + 1 shared)
Experts: 512 total
Context: 256,144 tokens natively
Layer pattern: 12 × (3 × (DeltaNet→MoE) + 1 × (Attention→MoE))
No thinking mode — the model generates code directly without <think> blocks. For agentic coding this is usually right: you want fast tool calls, not deliberation before every file read.
Qwen Code CLI
The model is open-weight, but the more immediately useful thing is the CLI.
# Linux/macOS
bash -c "$(curl -fsSL https://qwen-code-assets.oss-cn-hangzhou.aliyuncs.com/installation/install-qwen.sh)"
# macOS (Homebrew)
brew install qwen-code
# npm
npm install -g @qwen-code/qwen-code@latest
Then just run qwen. First launch asks you to authenticate.
Qwen OAuth (free): Sign into qwen.ai in a browser, get 1,000 requests/day at no cost. This is the quickest path to try it — no API key, no billing setup.
API-KEY: Connect to any compatible provider. The settings file (~/.qwen/settings.json) lets you point it at Anthropic, OpenAI, Gemini, or any OpenAI-compatible endpoint — including local vLLM or SGLang serving Qwen3-Coder-Next:
{
"modelProviders": {
"openai": [{
"id": "qwen3-coder-next-local",
"name": "qwen3-coder-next-local",
"baseUrl": "http://localhost:30000/v1",
"description": "Local Qwen3-Coder-Next via SGLang"
}]
},
"model": { "name": "qwen3-coder-next-local" }
}
The CLI has Skills (equivalent to Claude Code’s tool system) and SubAgents for parallel task execution. IDE integrations ship for VS Code, Zed, and JetBrains.
Running Locally
For self-hosted inference, two options:
SGLang (recommended for throughput):
python -m sglang.launch_server \
--model Qwen/Qwen3-Coder-Next \
--port 30000 \
--tp-size 2 \
--tool-call-parser qwen3_coder
vLLM:
vllm serve Qwen/Qwen3-Coder-Next \
--port 8000 \
--tensor-parallel-size 2 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
For local desktop use: Ollama and LMStudio have quantized versions. The 3B active parameter count means inference is fast even on quantized builds — much closer to running a 3B model than an 80B one.
Where It Fits
The “Claude Code killer” framing is clickbait but not wrong in one specific sense: Qwen Code + Qwen3-Coder-Next is now the most capable fully open-source coding agent stack. Apache 2.0 model, open-source CLI, free hosted tier, local inference supported.
That doesn’t make it better than Claude Code for every workflow — Claude’s model quality and Anthropic’s safety work are real advantages. But for developers who want a locally-runnable, auditable, no-API-cost coding agent, there’s now a serious option.
The 20,900 GitHub stars suggest the community has already noticed.
→ Qwen3-Coder-Next on HuggingFace
→ Qwen Code CLI on GitHub
Related: Qwen3.5 — the Gated DeltaNet + MoE architecture explained · Context rot in AI coding agents — and how to fix it · Understand-Anything — knowledge graphs for large codebases