RoBERTa
Meta · July 2019
● activeOpen Sourceencoder onlytext
Context Window512 tokens
Why It Matters
Proved that BERT's training recipe mattered as much as its architecture, establishing best practices for pre-training that influenced all subsequent encoder models.
Description
Meta's 'Robustly Optimized BERT' approach demonstrated that BERT was significantly undertrained and that careful tuning of hyperparameters and training data size could substantially improve performance. Became the go-to baseline for NLP research.
Key Innovations
Masked LM
Masked LMTraining by randomly hiding words and having the model predict them — BERT's key innovation for understanding context.
Scaling Laws
Scaling LawsMathematical relationships showing how model performance improves predictably with more data, compute, and parameters.