Mamba-3: The First SSM Built for the Inference Age
For the past two years, the dominant design philosophy in state space models has been: make training faster. Mamba-2 took this to its logical conclusion — simplifying the underlying recurrence to deliver 2–8x faster training than Mamba-1. It worked. Most architectures moved to Mamba-2.
But the AI landscape has quietly shifted. The bottleneck isn’t training anymore — it’s inference.
Post-training methods like RLVR generate massive rollouts. Agentic workflows hammer inference endpoints around the clock. The question researchers at CMU, Princeton, Cartesia AI, and Together AI started asking wasn’t how do we train faster? It was:
What would an SSM look like if inference efficiency was the primary goal from day one?
The answer is Mamba-3.
SSMs vs. Transformers: The Core Tradeoff
Before getting into the upgrades, it helps to understand where SSMs sit in the model landscape.
Transformers (GPT, Llama, Claude) store all past context in a KV cache that grows linearly with sequence length. That’s what makes them great at exact retrieval — they can look up anything in context. But it’s also why long-context inference gets expensive fast: more tokens = more memory = more latency.
State Space Models like Mamba compress past context into a fixed-size state and process each new token in O(1). The state doesn’t grow. That makes long-context generation dramatically cheaper — but the compression is lossy. Exact lookups (needle-in-a-haystack) are harder.
Diffusion LLMs like Mercury 2 and LLaDA 2.1 take a different approach entirely — parallel denoising instead of sequential generation — which is fast but handles variable-length generation differently.
JEPA-style models like what Yann LeCun is building at AMI Labs skip token prediction altogether in favor of learning latent world representations. A different game entirely.
Mamba-3 is squarely in the SSM camp — but it’s the most inference-optimized version of that idea built so far.
The Problem Mamba-2 Left Behind
Mamba-2 simplified the recurrence so aggressively to win on training benchmarks that decoding became memory-bound. The GPU isn’t computing — it’s mostly moving memory. Idle tensor cores represent wasted throughput.
Mamba-3 attacks this with three levers, all rooted in classical control theory rather than the linear attention / test-time training interpretations used by most modern alternatives.
Three Core Upgrades
1. More Expressive Recurrence
Mamba-2 reduced the state transition matrix to a scalar-times-identity operation. Fast to train. But “too simple” — the GPU moves memory more than it computes.
Mamba-3 replaces this with an exponential-trapezoidal discretization scheme — a general recurrence derived from classical SSM literature. More expressive dynamics mean the fixed state does more work per token, giving the compute units something to actually chew on.
Side effect: this recurrence implicitly applies a convolution-like operation on the input, eliminating the need for the short causal convolution that has been bolted onto every Mamba variant since Mamba-1. Simpler architecture, same or better performance.
2. Complex-Valued State Tracking
Mamba-3 models a complex-valued SSM system. Complex numbers can represent rotations and oscillations that real numbers cannot, expanding the model’s ability to track long-range dependencies within the fixed state.
In practice, this is implemented via data-dependent RoPE embeddings — reusing existing transformer infrastructure rather than rebuilding complex arithmetic kernels from scratch.
3. MIMO: Parallel SSMs, Accuracy for Free
Standard SSMs are SISO — single-input, single-output. Mamba-3’s MIMO variant runs multiple SSMs in parallel within each layer.
The result: +1 percentage point accuracy at 1B scale, better retrieval, essentially no decode latency penalty. The key insight: decoding is memory-bound, not compute-bound. The extra parallel streams use GPU cores that were sitting cold anyway. Training time goes up — you can’t hide those FLOPs during the forward pass — but inference cost stays flat.
The Numbers
At 1.5B scale on a single H100-SXM, prefill+decode latency (seconds):
| Model | n=512 | n=1024 | n=4096 | n=16,384 |
|---|---|---|---|---|
| Llama-3.2-1B (vLLM) | 4.45 | 9.60 | 58.64 | 976.50 |
| Gated DeltaNet | 4.56 | 9.11 | 36.41 | 145.87 |
| Mamba-2 | 4.66 | 9.32 | 37.22 | 149.02 |
| Mamba-3 SISO | 4.39 | 8.78 | 35.11 | 140.61 |
| Mamba-3 MIMO | 4.74 | 9.48 | 37.85 | 151.81 |
Mamba-3 SISO is the fastest model at every sequence length — including Llama on its best-optimized serving stack. At 16K tokens, the Transformer takes 976 seconds for the same batch that Mamba-3 handles in 140.
How to Use It
The Mamba-3 kernels are fully open-source, released by Together AI. Here’s how to get started:
If you’re replacing Mamba-2 in an existing pipeline: Use the SISO variant. It matches Mamba-2 exactly in architecture shapes (model dimensions, state size) and is a direct drop-in — just faster.
If you want better accuracy with similar decode speed: Use MIMO (r=4). Training will take longer, but inference latency is comparable to Mamba-2 while accuracy beats it by >1 point.
If you’re building hybrid models: The team predicts SSM layers interleaved with occasional self-attention will dominate going forward. Mamba-3 is the best SSM layer for that hybrid — expressive enough to handle context compression well, fast enough to justify using it over pure attention.
Kernel stack for custom implementations:
- Triton — SISO prefill, controlled tiling and kernel fusion, Hopper TMA support
- TileLang — MIMO prefill, explicit memory hierarchy control
- CuTe DSL — decode kernels, clean implementation, very fast
For a broader look at how these architecture choices compare across the model zoo, Sebastian Raschka’s LLM Architecture Gallery is the best single reference.
The Honest Tradeoff
Mamba-3 still can’t match Transformers on exact retrieval. Fixed state means compression, and compression is lossy. The team is explicit: hybrid architectures — SSM layers interleaved with self-attention — will be the dominant pattern. SSM handles context compression; attention handles the exact lookups.
This connects to a broader divergence in how researchers think about AI memory and representation. LeCun’s JEPA work argues that next-token prediction itself is the wrong objective for world modeling. Mamba-3 doesn’t weigh in on that debate — it’s purely about doing the current paradigm as efficiently as possible. But the systems-level argument holds regardless of the architecture: inference is now the bottleneck, and every design choice should reflect that.
Mamba-3 does.
Full paper and open-source kernels via Together AI’s blog. Research collaboration between CMU, Princeton University, Cartesia AI, and Together AI.