LLaMA 4

Meta · April 2025

activeOpen Weightmixture of expertsmultimodalAPI Available
Parameters17B active (Scout) / larger (Maverick)
Context Window10M (Scout) tokens
VariantsScout, Maverick

Why It Matters

First LLaMA to adopt Mixture-of-Experts architecture, offering a 10-million-token context window — the largest of any open model — while remaining efficient enough to run on a single server node.

Description

Meta's first Mixture-of-Experts (MoE) LLaMA — an architecture that uses multiple specialized sub-networks ('experts') and activates only a few for each input, making the model much more efficient. Scout uses 16 expert networks with 17B active parameters (109B total) and supports a 10-million-token context window — enough to process dozens of books at once. Natively handles text, images, and video.

Notable Milestones

  • 10M token context window — largest of any open model
  • First open MoE model from Meta
  • Native multimodal: text, image, and video understanding

Benchmark Scores

GPQAGraduate-level science QA
69.8%

Key Innovations

Multimodal
MultimodalProcessing multiple types of input (text, images, audio, video) in a single model.
MoE
MoEArchitecture where only a fraction of the model's parameters are active for each input, allowing massive scale with lower compute.
Open Weight
Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.
Long Context
Long ContextAbility to process very long inputs (100K+ tokens), enabling analysis of entire codebases or books.

Family Tree

Built On

Related Research (2)

RoPEArchitecture
2021 · Zhuiyi Technology

Introduced rotary position embeddings that encode position via rotation matrices, enabling better length generalization. Used by virtually every moder…

2023 · Google Research

Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, reducing KV cache memory while maintaining quality…

External Links