DistilBERT

Hugging Face · October 2019

● activeOpen Sourceencoder onlytext

Parameters66M

Description

Hugging Face's distilled version of BERT that retains 97% of BERT's language understanding capability while being 60% smaller and 60% faster. One of the first successful applications of knowledge distillation to large language models.

Key Innovations

Distillation

DistillationTraining a smaller 'student' model to mimic a larger 'teacher' model, preserving capability at lower cost.

Masked LM

Masked LMTraining by randomly hiding words and having the model predict them — BERT's key innovation for understanding context.

Family Tree

Built On

BERT

Lineage

BERT→DistilBERT

External Links

Research Paper

More from Foundational