Megatron-Turing NLG
NVIDIA / Microsoft · October 2021
● activeCloseddecoder onlytext
Parameters530B
Why It Matters
First 530B parameter model, proving massive scale was viable and laying groundwork for NVIDIA's AI model ambitions.
Description
A joint project between NVIDIA and Microsoft that produced the largest dense transformer model at the time of its release — 530 billion parameters. Dense means every parameter is used for every computation, unlike later mixture-of-experts models that only activate a subset. It demonstrated that massive scale was achievable with the right hardware and software infrastructure.
Key Innovations
Autoregressive
AutoregressiveGenerates text one token at a time, each prediction based on all previous tokens. The foundation of modern language models.
Scaling Laws
Scaling LawsMathematical relationships showing how model performance improves predictably with more data, compute, and parameters.
Family Tree
Successors (1)
Related Research (1)
Megatron-LMScaling
2019 · NVIDIA
Pioneered efficient model parallelism techniques enabling training of multi-billion parameter Transformers across GPUs.