LLaMA 4
Meta · April 2025
Why It Matters
First LLaMA to adopt Mixture-of-Experts architecture, offering a 10-million-token context window — the largest of any open model — while remaining efficient enough to run on a single server node.
Description
Meta's first Mixture-of-Experts (MoE) LLaMA — an architecture that uses multiple specialized sub-networks ('experts') and activates only a few for each input, making the model much more efficient. Scout uses 16 expert networks with 17B active parameters (109B total) and supports a 10-million-token context window — enough to process dozens of books at once. Natively handles text, images, and video.
Notable Milestones
- ▸10M token context window — largest of any open model
- ▸First open MoE model from Meta
- ▸Native multimodal: text, image, and video understanding
Benchmark Scores
Key Innovations
Related Research (2)
Introduced rotary position embeddings that encode position via rotation matrices, enabling better length generalization. Used by virtually every moder…
Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, reducing KV cache memory while maintaining quality…