Together AI's new state space model delivers faster decode speeds than Transformers while staying open-source. What builders need to know about the inference efficiency shift.

Reduce inference latency and memory costs for streaming/real-time applications while maintaining full control over your model and deployment stack.
Signal analysis
Here at industry sources, we track inference model releases with a single lens: what changes the cost-performance tradeoff for builders. Mamba-3 moves that needle. Together AI released an open-source state space model (SSM) that outperforms Mamba-2 while delivering meaningfully faster decode speeds than Transformer-based models. This isn't incremental. The architectural shift from attention mechanisms to SSMs addresses a concrete problem - decoding latency and per-token memory requirements that compound in production.
The model is available open-source from day one, which removes licensing friction. You can run it on your infrastructure, fine-tune it without vendor approval, or integrate it into products where closed-source APIs create bottlenecks. The decode speed advantage matters most for real-time applications - chat interfaces, streaming completions, or any system where latency compounds user experience.
Mamba-3 sits in an interesting position: it's not attempting to beat GPT-4 on capabilities. It's optimized specifically for the inference workload that dominates builder costs - the decode phase where models generate tokens one-at-a-time. That focus shapes how useful this is for your stack.
The decode speed advantage makes Mamba-3 valuable for specific workloads, not all workloads. If you're building chat applications, streaming completions, or systems where sub-100ms latency impacts retention, this model should be in your evaluation matrix. The open-source status means you can benchmark it against your current setup without procurement delays.
For latency-sensitive applications at scale, the memory efficiency argument compounds. Lower per-token memory means you can serve more concurrent requests on the same hardware. On commodity GPUs or in CPU-constrained environments, that translates directly to cost reduction. Builders deploying at volume should test this specifically - the theoretical improvement becomes either marginal or transformative depending on your batch size and hardware profile.
The tradeoff: Mamba-3 is not a generalist replacement for larger models. If you need cross-domain capability, reasoning depth, or specialized knowledge, you're likely still reaching for larger models in your pipeline. The win here is using the right tool for the right phase - Mamba-3 for high-velocity decode, larger models for initial token generation or complex reasoning.
Mamba-3 represents a broader trend - state space models moving from research curiosity to production infrastructure. The Mamba family (starting with Mamba-1) has been incrementally proving that SSM architectures can compete with Transformers on capability while winning decisively on efficiency. Each release from Together AI and other teams narrows the capability gap while the efficiency advantage grows.
This signals two things builders should track: first, the Transformer dominance in inference is not inevitable. Alternative architectures can deliver better tradeoffs for specific use cases. Second, the open-source infrastructure for SSMs is maturing. You can now run, fine-tune, and optimize these models without depending on a single vendor or API.
The competitive dynamics matter. If SSMs prove superior for inference economics, pressure increases on model labs to release inference-optimized variants. Closed-source vendors may offer smaller, specialized models to compete. Builders benefit from this convergence - more choice, faster optimization cycles, and the ability to control your inference stack. The momentum in this space continues to accelerate.
Best use cases
Open the scenarios below to see where this shift creates the clearest practical advantage.
One concise email with the releases, workflow changes, and AI dev moves worth paying attention to.
More updates in the same lane.
Inngest's latest update introduces Durable Endpoints streaming support, improving long-running workflow management for developers.
Cloudflare MCP now offers visualized workflows through step diagrams, enhancing understanding and usability for developers.
Cloudflare MCP's new client-side security tools enhance detection capabilities, reducing false positives significantly while safeguarding against zero-day exploits.