DeepSeek V3
DeepSeek · December 2024
Why It Matters
Trained for just $5.5 million, proving frontier performance doesn't require billions in compute. Shook the AI industry's assumption that only big tech could compete.
Description
A 671 billion parameter model (37B active) that matched GPT-4 and Claude 3.5 Sonnet in performance while costing just $5.5 million to train — a fraction of what competitors spent. Used FP8 mixed-precision training (a technique that uses lower-precision numbers to speed up computation without losing quality) and multi-token prediction to achieve frontier results on a budget.
Notable Milestones
- ▸Matched GPT-4 level performance at a fraction of training cost
- ▸Caused significant stock market reactions in AI chip companies
- ▸Demonstrated FP8 training at scale for the first time
Benchmark Scores
Key Innovations
Family Tree
Built On
Lineage
Successors (2)
Related Research (2)
Introduced Multi-head Latent Attention (MLA), which compresses the key-value cache into a low-rank latent space, dramatically reducing the memory need…
Demonstrated that pure RL training (without supervised fine-tuning on reasoning traces) can produce chain-of-thought reasoning, achieving performance …