Falcon 40B
TII (UAE) · May 2023
◌ legacyOpen Sourcedecoder onlytext
Parameters40B
Context Window2K tokens
Why It Matters
First model from outside the US/China to top the open-source LLM leaderboard, proving that high-quality training data (RefinedWeb) could be more important than sheer model size.
Description
A 40 billion parameter model from the UAE's Technology Innovation Institute, trained on RefinedWeb — a massive dataset of 1 trillion tokens of high-quality web text that was automatically filtered for quality. Released under the permissive Apache 2.0 license, it topped the Hugging Face Open LLM Leaderboard upon release, becoming the best open-source model in the world at that time.
Notable Milestones
- ▸Topped Hugging Face Open LLM Leaderboard on release
- ▸Pioneered the RefinedWeb dataset approach to web data curation
Key Innovations
Open Weight
Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.