What is Mamba-3 and who built it?

Mamba-3 is a new state space model (SSM) from researchers at Carnegie Mellon University, Princeton University, Cartesia AI, and Together AI. It was designed specifically to maximize inference efficiency — a deliberate departure from Mamba-2, which was optimized for training speed. The kernels are fully open-source.

How is Mamba-3 different from a Transformer like GPT or Llama?

Transformers store all past context in a KV cache that grows linearly with sequence length — great for retrieval, expensive at scale. Mamba-3 is a state space model with a fixed-size state: it compresses context continuously and processes each new token in O(1) time. The tradeoff is that exact retrieval (needle-in-a-haystack) is harder for SSMs, but for long-context generation and agentic workloads, SSMs are dramatically faster.

How does Mamba-3 differ from Mamba-2?

Mamba-2 simplified the recurrence formula to win on training benchmarks, but left decoding memory-bound. Mamba-3 reverses this with three upgrades: a richer exponential-trapezoidal recurrence, complex-valued state tracking via data-dependent RoPE, and a MIMO variant that runs parallel SSMs per layer. The result is a more expressive model that is also faster to run.

Does Mamba-3 beat Transformers on speed?

Yes — on prefill+decode latency at 1.5B scale, Mamba-3 SISO outperforms Mamba-2, Gated DeltaNet, and even Llama-3.2-1B on the highly optimized vLLM stack, at every tested sequence length from 512 to 16,384 tokens.

How can I use Mamba-3 today?

The Mamba-3 kernels are open-sourced by Together AI and built on Triton, TileLang, and CuTe DSL. You can use the SISO variant as a drop-in improvement over Mamba-2 in any SSM-based pipeline, or explore the MIMO variant for higher accuracy at similar decode speeds. The model weights and code are available via Together AI's GitHub.

What is the MIMO variant and when should I use it?

MIMO (multi-input, multi-output) runs multiple SSMs in parallel per layer. It boosts accuracy by over 1 percentage point at 1B scale and improves retrieval performance, with decode latency roughly matching Mamba-2. Use SISO if raw speed is the priority; use MIMO if you want better accuracy without paying a decode penalty.

Mamba-3: The First SSM Built for the Inference Age

By Prahlad Menon Published 2026-03-20 4 min read

For the past two years, the dominant design philosophy in state space models has been: make training faster. Mamba-2 took this to its logical conclusion — simplifying the underlying recurrence to deliver 2–8x faster training than Mamba-1. It worked. Most architectures moved to Mamba-2.

But the AI landscape has quietly shifted. The bottleneck isn’t training anymore — it’s inference.

Post-training methods like RLVR generate massive rollouts. Agentic workflows hammer inference endpoints around the clock. The question researchers at CMU, Princeton, Cartesia AI, and Together AI started asking wasn’t how do we train faster? It was:

What would an SSM look like if inference efficiency was the primary goal from day one?

The answer is Mamba-3.

SSMs vs. Transformers: The Core Tradeoff

Before getting into the upgrades, it helps to understand where SSMs sit in the model landscape.

Transformers (GPT, Llama, Claude) store all past context in a KV cache that grows linearly with sequence length. That’s what makes them great at exact retrieval — they can look up anything in context. But it’s also why long-context inference gets expensive fast: more tokens = more memory = more latency.

State Space Models like Mamba compress past context into a fixed-size state and process each new token in O(1). The state doesn’t grow. That makes long-context generation dramatically cheaper — but the compression is lossy. Exact lookups (needle-in-a-haystack) are harder.

Diffusion LLMs like Mercury 2 and LLaDA 2.1 take a different approach entirely — parallel denoising instead of sequential generation — which is fast but handles variable-length generation differently.

JEPA-style models like what Yann LeCun is building at AMI Labs skip token prediction altogether in favor of learning latent world representations. A different game entirely.

Mamba-3 is squarely in the SSM camp — but it’s the most inference-optimized version of that idea built so far.

The Problem Mamba-2 Left Behind

Mamba-2 simplified the recurrence so aggressively to win on training benchmarks that decoding became memory-bound. The GPU isn’t computing — it’s mostly moving memory. Idle tensor cores represent wasted throughput.

Mamba-3 attacks this with three levers, all rooted in classical control theory rather than the linear attention / test-time training interpretations used by most modern alternatives.

Three Core Upgrades

1. More Expressive Recurrence

Mamba-2 reduced the state transition matrix to a scalar-times-identity operation. Fast to train. But “too simple” — the GPU moves memory more than it computes.

Mamba-3 replaces this with an exponential-trapezoidal discretization scheme — a general recurrence derived from classical SSM literature. More expressive dynamics mean the fixed state does more work per token, giving the compute units something to actually chew on.

Side effect: this recurrence implicitly applies a convolution-like operation on the input, eliminating the need for the short causal convolution that has been bolted onto every Mamba variant since Mamba-1. Simpler architecture, same or better performance.

2. Complex-Valued State Tracking

Mamba-3 models a complex-valued SSM system. Complex numbers can represent rotations and oscillations that real numbers cannot, expanding the model’s ability to track long-range dependencies within the fixed state.

In practice, this is implemented via data-dependent RoPE embeddings — reusing existing transformer infrastructure rather than rebuilding complex arithmetic kernels from scratch.

3. MIMO: Parallel SSMs, Accuracy for Free

Standard SSMs are SISO — single-input, single-output. Mamba-3’s MIMO variant runs multiple SSMs in parallel within each layer.

The result: +1 percentage point accuracy at 1B scale, better retrieval, essentially no decode latency penalty. The key insight: decoding is memory-bound, not compute-bound. The extra parallel streams use GPU cores that were sitting cold anyway. Training time goes up — you can’t hide those FLOPs during the forward pass — but inference cost stays flat.

The Numbers

At 1.5B scale on a single H100-SXM, prefill+decode latency (seconds):

Model	n=512	n=1024	n=4096	n=16,384
Llama-3.2-1B (vLLM)	4.45	9.60	58.64	976.50
Gated DeltaNet	4.56	9.11	36.41	145.87
Mamba-2	4.66	9.32	37.22	149.02
Mamba-3 SISO	4.39	8.78	35.11	140.61
Mamba-3 MIMO	4.74	9.48	37.85	151.81

Mamba-3 SISO is the fastest model at every sequence length — including Llama on its best-optimized serving stack. At 16K tokens, the Transformer takes 976 seconds for the same batch that Mamba-3 handles in 140.

How to Use It

The Mamba-3 kernels are fully open-source, released by Together AI. Here’s how to get started:

If you’re replacing Mamba-2 in an existing pipeline: Use the SISO variant. It matches Mamba-2 exactly in architecture shapes (model dimensions, state size) and is a direct drop-in — just faster.

If you want better accuracy with similar decode speed: Use MIMO (r=4). Training will take longer, but inference latency is comparable to Mamba-2 while accuracy beats it by >1 point.

If you’re building hybrid models: The team predicts SSM layers interleaved with occasional self-attention will dominate going forward. Mamba-3 is the best SSM layer for that hybrid — expressive enough to handle context compression well, fast enough to justify using it over pure attention.

Kernel stack for custom implementations:

Triton — SISO prefill, controlled tiling and kernel fusion, Hopper TMA support
TileLang — MIMO prefill, explicit memory hierarchy control
CuTe DSL — decode kernels, clean implementation, very fast

For a broader look at how these architecture choices compare across the model zoo, Sebastian Raschka’s LLM Architecture Gallery is the best single reference.

The Honest Tradeoff

Mamba-3 still can’t match Transformers on exact retrieval. Fixed state means compression, and compression is lossy. The team is explicit: hybrid architectures — SSM layers interleaved with self-attention — will be the dominant pattern. SSM handles context compression; attention handles the exact lookups.

This connects to a broader divergence in how researchers think about AI memory and representation. LeCun’s JEPA work argues that next-token prediction itself is the wrong objective for world modeling. Mamba-3 doesn’t weigh in on that debate — it’s purely about doing the current paradigm as efficiently as possible. But the systems-level argument holds regardless of the architecture: inference is now the bottleneck, and every design choice should reflect that.

Mamba-3 does.

Full paper and open-source kernels via Together AI’s blog. Research collaboration between CMU, Princeton University, Cartesia AI, and Together AI.