LLaMA 3.3
Meta · December 2024
● activeOpen Weightdecoder onlytextAPI Available
Parameters70B
Context Window128K tokens
Description
A highly optimized 70B model that matches the much larger LLaMA 3.1 405B on many benchmarks, achieved through distillation — a technique where a smaller model is trained to mimic the outputs of a larger, more capable one. Supports multiple languages and costs significantly less to run.
Notable Milestones
- ▸Matches 405B-level performance at a fraction of compute cost
- ▸Strong multilingual text generation
Benchmark Scores
MMLUMassive Multitask Language Understanding — 57 subjects
86.0%HumanEvalCode generation pass@1 — Python problems
88.4%MATHMATH benchmark — competition-level problems
77.0%Key Innovations
Open Weight
Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.
Distillation
DistillationTraining a smaller 'student' model to mimic a larger 'teacher' model, preserving capability at lower cost.
Related Research (2)
RoPEArchitecture
2021 · Zhuiyi Technology
Introduced rotary position embeddings that encode position via rotation matrices, enabling better length generalization. Used by virtually every moder…
Grouped-Query AttentionArchitecture
2023 · Google Research
Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, reducing KV cache memory while maintaining quality…