LLaMA 3.3

Meta · December 2024

activeOpen Weightdecoder onlytextAPI Available
Parameters70B
Context Window128K tokens

Description

A highly optimized 70B model that matches the much larger LLaMA 3.1 405B on many benchmarks, achieved through distillation — a technique where a smaller model is trained to mimic the outputs of a larger, more capable one. Supports multiple languages and costs significantly less to run.

Notable Milestones

  • Matches 405B-level performance at a fraction of compute cost
  • Strong multilingual text generation

Benchmark Scores

MMLUMassive Multitask Language Understanding — 57 subjects
86.0%
HumanEvalCode generation pass@1 — Python problems
88.4%
MATHMATH benchmark — competition-level problems
77.0%

Key Innovations

Open Weight
Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.
Distillation
DistillationTraining a smaller 'student' model to mimic a larger 'teacher' model, preserving capability at lower cost.

Family Tree

Built On

Lineage

Successors (1)

Related Research (2)

RoPEArchitecture
2021 · Zhuiyi Technology

Introduced rotary position embeddings that encode position via rotation matrices, enabling better length generalization. Used by virtually every moder…

2023 · Google Research

Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, reducing KV cache memory while maintaining quality…

External Links