Jamba
AI21 Labs · March 2024
Why It Matters
First production-grade hybrid Mamba-Transformer model — proved that combining linear-time Mamba layers with Transformer attention could match pure Transformer quality at far lower compute cost.
Description
The first production-grade model to combine two different neural network architectures: Mamba (which processes sequences in linear time, meaning it scales efficiently to very long texts) and the standard Transformer (which excels at capturing relationships between distant parts of a text). This hybrid approach, combined with a Mixture-of-Experts design (where only a fraction of the model's parameters activate for each input), delivered strong performance with significantly less computational cost than pure Transformer models.
Key Innovations
Related Research (1)
Introduced selective state space models that process sequences in linear time (vs. quadratic for Transformers), with a data-dependent selection mechan…