DistilBERT
Hugging Face · October 2019
● activeOpen Sourceencoder onlytext
Parameters66M
Description
Hugging Face's distilled version of BERT that retains 97% of BERT's language understanding capability while being 60% smaller and 60% faster. One of the first successful applications of knowledge distillation to large language models.
Key Innovations
Distillation
DistillationTraining a smaller 'student' model to mimic a larger 'teacher' model, preserving capability at lower cost.
Masked LM
Masked LMTraining by randomly hiding words and having the model predict them — BERT's key innovation for understanding context.