Jamba

AI21 Labs · March 2024

activeOpen Weighthybrid mamba transformertext
Parameters52B (12B active)
Context Window256K tokens

Why It Matters

First production-grade hybrid Mamba-Transformer model — proved that combining linear-time Mamba layers with Transformer attention could match pure Transformer quality at far lower compute cost.

Description

The first production-grade model to combine two different neural network architectures: Mamba (which processes sequences in linear time, meaning it scales efficiently to very long texts) and the standard Transformer (which excels at capturing relationships between distant parts of a text). This hybrid approach, combined with a Mixture-of-Experts design (where only a fraction of the model's parameters activate for each input), delivered strong performance with significantly less computational cost than pure Transformer models.

Key Innovations

MoE
MoEArchitecture where only a fraction of the model's parameters are active for each input, allowing massive scale with lower compute.
Long Context
Long ContextAbility to process very long inputs (100K+ tokens), enabling analysis of entire codebases or books.

Family Tree

Built On

Lineage

Jurassic-2Jamba

Successors (1)

Related Research (1)

MambaArchitecture
2023 · Carnegie Mellon University / Princeton

Introduced selective state space models that process sequences in linear time (vs. quadratic for Transformers), with a data-dependent selection mechan…

External Links