Mixtral 8x7B

Mistral AI · December 2023

activeOpen Sourcemixture of expertstext
Parameters46.7B total (12.9B active)
Context Window32K tokens

Why It Matters

Proved that Mixture-of-Experts architecture could match GPT-3.5 quality at a fraction of the compute cost, making frontier-level AI accessible to run on consumer hardware.

Description

An open-source Mixture-of-Experts (MoE) model containing 8 specialized sub-networks ('experts') with 46.7B total parameters, but only activating 2 experts (12.9B parameters) for each piece of text. This makes it as fast as a 13B model while delivering GPT-3.5-level quality — proving that clever architecture can substitute for raw size.

Notable Milestones

  • Matched GPT-3.5 Turbo quality as a fully open model
  • Demonstrated MoE efficiency: 13B-speed with 47B-quality
  • Widely adopted for self-hosted enterprise deployments

Benchmark Scores

MMLUMassive Multitask Language Understanding — 57 subjects
70.6%
HumanEvalCode generation pass@1 — Python problems
40.2%

Key Innovations

MoE
MoEArchitecture where only a fraction of the model's parameters are active for each input, allowing massive scale with lower compute.
Open Weight
Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.

Family Tree

Built On

Lineage

Mistral 7BMixtral 8x7B

Related Research (7)

Sparse MoEScaling
2017 · Google

Introduced sparsely-gated Mixture-of-Experts layers for scaling model capacity without proportional compute increase.

2021 · Google

Simplified MoE routing to scale to trillions of parameters efficiently. Influenced Mixtral and GPT-4/5 MoE architectures.

GShardArchitecture
2020 · Google

Scaled Mixture-of-Experts to 600 billion parameters with automatic model parallelism across thousands of TPUs, showing how to train models far beyond …

RoPEArchitecture
2021 · Zhuiyi Technology

Introduced rotary position embeddings that encode position via rotation matrices, enabling better length generalization. Used by virtually every moder…

2023 · Google Research

Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, reducing KV cache memory while maintaining quality…

SwiGLUArchitecture
2020 · Google

Showed that SwiGLU activation (Swish + Gated Linear Unit) significantly improves Transformer FFN quality with minimal compute overhead.

Mistral 7BScaling
2023 · Mistral AI

Introduced sliding window attention and demonstrated that a 7B model could outperform LLaMA 2 13B on all benchmarks.