What is Qwen3-Coder-Next?

Qwen3-Coder-Next is an open-weight coding model from Alibaba's Qwen team. It has 80B total parameters but only 3B activated per token (MoE architecture), and outperforms models with 10–20x more active parameters on SWE-Bench-Pro. It uses a hybrid Gated DeltaNet + Gated Attention + MoE architecture with 256K native context. Apache 2.0 licensed.

How does 3B active params beat models 10–20x larger?

Two reasons: architecture efficiency and task specialization. The hybrid Gated DeltaNet architecture handles most layers with linear attention (much cheaper per token) and only uses standard attention every 4th layer. MoE with 512 experts (10 activated per token) means the model routes each coding task to specialized sub-networks. The model was also trained specifically for agentic coding — long-horizon reasoning, tool use, and recovery from execution failures — not general-purpose tasks.

What is Qwen Code CLI?

Qwen Code is an open-source terminal coding agent (20,900+ GitHub stars) optimized for Qwen3-Coder models. It works like Claude Code: reads your codebase, runs tools, edits files, and executes tasks from natural language. Supports Qwen OAuth for 1,000 free requests/day, or connects to any OpenAI/Anthropic/Gemini-compatible API. VS Code, Zed, and JetBrains integrations included.

Is Qwen Code CLI a Claude Code clone?

It's inspired by Claude Code's terminal-agent pattern, but it's original code built by the Qwen team. It supports multiple authentication methods (Qwen OAuth, API keys for any compatible provider including Anthropic and OpenAI), has its own Skills and SubAgents system, and is co-developed with the Qwen3-Coder model. You can also use it with Claude or GPT models via the API-KEY auth path.

How do I run Qwen3-Coder-Next locally?

For inference: Ollama, LMStudio, llama.cpp, MLX-LM, and KTransformers all support it. For production serving: SGLang (v0.5.8+) or vLLM (0.15.0+) for OpenAI-compatible API endpoints. The 80B total size requires multi-GPU for full precision, but quantized versions run on a single GPU. The 3B active parameter count makes inference much cheaper than the 80B number suggests.

What coding benchmarks has it been tested on?

SWE-Bench-Pro is the key benchmark — it tests real-world GitHub issue resolution, not synthetic coding problems. Qwen3-Coder-Next outperforms models with 10–20x more active parameters on this benchmark. The model also excels at long-horizon reasoning (multi-step tasks), complex tool usage, and recovering from execution failures — capabilities that matter more for agentic coding than single-shot completions.

Qwen3-Coder-Next: 3B Active Params, Beats Models 20x Its Size

By Prahlad Menon Published 2026-03-22 4 min read

The claim sounds like marketing: 3 billion active parameters beating models with 60 billion. But Qwen3-Coder-Next has 1.2 million downloads on HuggingFace and the benchmarks hold up.

Qwen3-Coder-Next is an 80B Mixture of Experts model where only 3B parameters activate per token. It was built specifically for agentic coding — not chat, not reasoning in the abstract, but the actual work of reading a codebase, calling tools, writing fixes, and recovering when something breaks.

Alongside it: Qwen Code — an open-source terminal coding agent that just crossed 20,900 GitHub stars.

Why 3B Active Params Can Beat 30-60B Models

The architecture explanation starts with MoE, but that’s not the whole story.

Standard transformers activate every parameter for every token. MoE models route each token to a subset of “expert” networks — Qwen3-Coder-Next has 512 experts and activates 10 per token (plus 1 shared expert). So the 80B parameter count is capacity; the 3B is the actual compute cost per inference step.

But MoE alone doesn’t explain the benchmark results. The second part is the hybrid attention architecture — the same Gated DeltaNet design that Qwen3.5 introduced. Three out of every four layers use Gated DeltaNet (linear attention — scales linearly with sequence length). The fourth uses standard Gated Attention. At 256K context, this is dramatically cheaper than full quadratic attention on every layer.

The third part is specialization. Qwen3-Coder-Next was trained specifically for:

Long-horizon reasoning — multi-step coding tasks where the model needs to track state across many actions
Complex tool usage — calling tools correctly, chaining them, handling errors
Recovery from execution failures — when a tool call fails or produces unexpected output, adjusting rather than giving up

These are exactly the capabilities SWE-Bench-Pro tests — not “complete this function” but “fix this real GitHub issue,” which requires all three.

The Architecture Numbers

Total parameters: 80B
Active per token: 3B (10 experts + 1 shared)
Experts: 512 total
Context: 256,144 tokens natively
Layer pattern: 12 × (3 × (DeltaNet→MoE) + 1 × (Attention→MoE))

No thinking mode — the model generates code directly without <think> blocks. For agentic coding this is usually right: you want fast tool calls, not deliberation before every file read.

Qwen Code CLI

The model is open-weight, but the more immediately useful thing is the CLI.

# Linux/macOS
bash -c "$(curl -fsSL https://qwen-code-assets.oss-cn-hangzhou.aliyuncs.com/installation/install-qwen.sh)"

# macOS (Homebrew)
brew install qwen-code

# npm
npm install -g @qwen-code/qwen-code@latest

Then just run qwen. First launch asks you to authenticate.

Qwen OAuth (free): Sign into qwen.ai in a browser, get 1,000 requests/day at no cost. This is the quickest path to try it — no API key, no billing setup.

API-KEY: Connect to any compatible provider. The settings file (~/.qwen/settings.json) lets you point it at Anthropic, OpenAI, Gemini, or any OpenAI-compatible endpoint — including local vLLM or SGLang serving Qwen3-Coder-Next:

{
  "modelProviders": {
    "openai": [{
      "id": "qwen3-coder-next-local",
      "name": "qwen3-coder-next-local",
      "baseUrl": "http://localhost:30000/v1",
      "description": "Local Qwen3-Coder-Next via SGLang"
    }]
  },
  "model": { "name": "qwen3-coder-next-local" }
}

The CLI has Skills (equivalent to Claude Code’s tool system) and SubAgents for parallel task execution. IDE integrations ship for VS Code, Zed, and JetBrains.

Running Locally

For self-hosted inference, two options:

SGLang (recommended for throughput):

python -m sglang.launch_server \
  --model Qwen/Qwen3-Coder-Next \
  --port 30000 \
  --tp-size 2 \
  --tool-call-parser qwen3_coder

vLLM:

vllm serve Qwen/Qwen3-Coder-Next \
  --port 8000 \
  --tensor-parallel-size 2 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

For local desktop use: Ollama and LMStudio have quantized versions. The 3B active parameter count means inference is fast even on quantized builds — much closer to running a 3B model than an 80B one.

Where It Fits

The “Claude Code killer” framing is clickbait but not wrong in one specific sense: Qwen Code + Qwen3-Coder-Next is now the most capable fully open-source coding agent stack. Apache 2.0 model, open-source CLI, free hosted tier, local inference supported.

That doesn’t make it better than Claude Code for every workflow — Claude’s model quality and Anthropic’s safety work are real advantages. But for developers who want a locally-runnable, auditable, no-API-cost coding agent, there’s now a serious option.

The 20,900 GitHub stars suggest the community has already noticed.

→ Qwen3-Coder-Next on HuggingFace
→ Qwen Code CLI on GitHub