Mistral 7B

Mistral AI · September 2023

activeOpen Sourcedecoder onlytext
Parameters7B
Context Window32K tokens

Why It Matters

Proved that a small, well-engineered model from a European startup could beat much larger competitors, establishing Mistral AI as a major force in open-source AI.

Description

A remarkably efficient 7 billion parameter model that outperformed the much larger LLaMA 2 13B. Uses sliding window attention — a technique that limits each word to only attending to nearby words rather than the entire text, dramatically reducing memory usage while maintaining quality. Released under the permissive Apache 2.0 license.

Notable Milestones

  • Outperformed LLaMA 2 13B despite being nearly half the size
  • One of the most fine-tuned base models in the open-source community
  • Apache 2.0 license enabled unrestricted commercial use

Key Innovations

Open Weight
Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.

Related Research (5)

ChinchillaScaling
2022 · DeepMind

Challenged Kaplan's scaling laws by showing data should scale equally to parameters. 70B Chinchilla outperformed 280B Gopher.

RoPEArchitecture
2021 · Zhuiyi Technology

Introduced rotary position embeddings that encode position via rotation matrices, enabling better length generalization. Used by virtually every moder…

2023 · Google Research

Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, reducing KV cache memory while maintaining quality…

SwiGLUArchitecture
2020 · Google

Showed that SwiGLU activation (Swish + Gated Linear Unit) significantly improves Transformer FFN quality with minimal compute overhead.

Mistral 7BScaling
2023 · Mistral AI

Introduced sliding window attention and demonstrated that a 7B model could outperform LLaMA 2 13B on all benchmarks.