LLaMA 4

Meta · April 2025

● activeOpen Weightmixture of expertsmultimodalAPI Available

Parameters17B active (Scout) / larger (Maverick)

Context Window10M (Scout) tokens

VariantsScout, Maverick

Why It Matters

First LLaMA to adopt Mixture-of-Experts architecture, offering a 10-million-token context window — the largest of any open model — while remaining efficient enough to run on a single server node.

Description

Meta's first Mixture-of-Experts (MoE) LLaMA — an architecture that uses multiple specialized sub-networks ('experts') and activates only a few for each input, making the model much more efficient. Scout uses 16 expert networks with 17B active parameters (109B total) and supports a 10-million-token context window — enough to process dozens of books at once. Natively handles text, images, and video.

Notable Milestones

▸10M token context window — largest of any open model
▸First open MoE model from Meta
▸Native multimodal: text, image, and video understanding

Benchmark Scores

GPQAGraduate-level science QA

69.8%

Key Innovations

Multimodal

MultimodalProcessing multiple types of input (text, images, audio, video) in a single model.

MoE

MoEArchitecture where only a fraction of the model's parameters are active for each input, allowing massive scale with lower compute.

Open Weight

Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.

Long Context

Long ContextAbility to process very long inputs (100K+ tokens), enabling analysis of entire codebases or books.

Family Tree

Related Research (2)

RoPEArchitecture

2021 · Zhuiyi Technology

Introduced rotary position embeddings that encode position via rotation matrices, enabling better length generalization. Used by virtually every moder…

Grouped-Query AttentionArchitecture

2023 · Google Research

Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, reducing KV cache memory while maintaining quality…