Mistral 7B
Mistral AI · September 2023
Why It Matters
Proved that a small, well-engineered model from a European startup could beat much larger competitors, establishing Mistral AI as a major force in open-source AI.
Description
A remarkably efficient 7 billion parameter model that outperformed the much larger LLaMA 2 13B. Uses sliding window attention — a technique that limits each word to only attending to nearby words rather than the entire text, dramatically reducing memory usage while maintaining quality. Released under the permissive Apache 2.0 license.
Notable Milestones
- ▸Outperformed LLaMA 2 13B despite being nearly half the size
- ▸One of the most fine-tuned base models in the open-source community
- ▸Apache 2.0 license enabled unrestricted commercial use
Key Innovations
Family Tree
Successors (3)
Related Research (5)
Challenged Kaplan's scaling laws by showing data should scale equally to parameters. 70B Chinchilla outperformed 280B Gopher.
Introduced rotary position embeddings that encode position via rotation matrices, enabling better length generalization. Used by virtually every moder…
Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, reducing KV cache memory while maintaining quality…
Showed that SwiGLU activation (Swish + Gated Linear Unit) significantly improves Transformer FFN quality with minimal compute overhead.
Introduced sliding window attention and demonstrated that a 7B model could outperform LLaMA 2 13B on all benchmarks.