RoBERTa

Meta · July 2019

activeOpen Sourceencoder onlytext
Context Window512 tokens

Why It Matters

Proved that BERT's training recipe mattered as much as its architecture, establishing best practices for pre-training that influenced all subsequent encoder models.

Description

Meta's 'Robustly Optimized BERT' approach demonstrated that BERT was significantly undertrained and that careful tuning of hyperparameters and training data size could substantially improve performance. Became the go-to baseline for NLP research.

Key Innovations

Masked LM
Masked LMTraining by randomly hiding words and having the model predict them — BERT's key innovation for understanding context.
Scaling Laws
Scaling LawsMathematical relationships showing how model performance improves predictably with more data, compute, and parameters.

Family Tree

Built On

Lineage

BERTRoBERTa

External Links