Jamba

AI21 Labs · March 2024

● activeOpen Weighthybrid mamba transformertext

Parameters52B (12B active)

Context Window256K tokens

Why It Matters

First production-grade hybrid Mamba-Transformer model — proved that combining linear-time Mamba layers with Transformer attention could match pure Transformer quality at far lower compute cost.

Description

The first production-grade model to combine two different neural network architectures: Mamba (which processes sequences in linear time, meaning it scales efficiently to very long texts) and the standard Transformer (which excels at capturing relationships between distant parts of a text). This hybrid approach, combined with a Mixture-of-Experts design (where only a fraction of the model's parameters activate for each input), delivered strong performance with significantly less computational cost than pure Transformer models.

Key Innovations

MoE

MoEArchitecture where only a fraction of the model's parameters are active for each input, allowing massive scale with lower compute.

Long Context

Long ContextAbility to process very long inputs (100K+ tokens), enabling analysis of entire codebases or books.

Family Tree

Related Research (1)

MambaArchitecture

2023 · Carnegie Mellon University / Princeton

Introduced selective state space models that process sequences in linear time (vs. quadratic for Transformers), with a data-dependent selection mechan…

External Links

Research Paper

More from AI21 Labs

Jurassic-22023-03 · 178B

Jamba 1.52024-08 · 398B (94B active)

PreviousJurassic-2

NextJamba 1.5