LLaMA 3.3

Meta · December 2024

● activeOpen Weightdecoder onlytextAPI Available

Parameters70B

Context Window128K tokens

Description

A highly optimized 70B model that matches the much larger LLaMA 3.1 405B on many benchmarks, achieved through distillation — a technique where a smaller model is trained to mimic the outputs of a larger, more capable one. Supports multiple languages and costs significantly less to run.

Notable Milestones

▸Matches 405B-level performance at a fraction of compute cost
▸Strong multilingual text generation

Benchmark Scores

MMLUMassive Multitask Language Understanding — 57 subjects

86.0%

HumanEvalCode generation pass@1 — Python problems

88.4%

MATHMATH benchmark — competition-level problems

77.0%

Key Innovations

Open Weight

Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.

Distillation

DistillationTraining a smaller 'student' model to mimic a larger 'teacher' model, preserving capability at lower cost.

Family Tree

Related Research (2)

RoPEArchitecture

2021 · Zhuiyi Technology

Introduced rotary position embeddings that encode position via rotation matrices, enabling better length generalization. Used by virtually every moder…

Grouped-Query AttentionArchitecture

2023 · Google Research

Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, reducing KV cache memory while maintaining quality…