Mixtral 8x7B

Mistral AI · December 2023

● activeOpen Sourcemixture of expertstext

Parameters46.7B total (12.9B active)

Context Window32K tokens

Why It Matters

Proved that Mixture-of-Experts architecture could match GPT-3.5 quality at a fraction of the compute cost, making frontier-level AI accessible to run on consumer hardware.

Description

An open-source Mixture-of-Experts (MoE) model containing 8 specialized sub-networks ('experts') with 46.7B total parameters, but only activating 2 experts (12.9B parameters) for each piece of text. This makes it as fast as a 13B model while delivering GPT-3.5-level quality — proving that clever architecture can substitute for raw size.

Notable Milestones

▸Matched GPT-3.5 Turbo quality as a fully open model
▸Demonstrated MoE efficiency: 13B-speed with 47B-quality
▸Widely adopted for self-hosted enterprise deployments

Benchmark Scores

MMLUMassive Multitask Language Understanding — 57 subjects

70.6%

HumanEvalCode generation pass@1 — Python problems

40.2%

Key Innovations

MoE

MoEArchitecture where only a fraction of the model's parameters are active for each input, allowing massive scale with lower compute.

Open Weight

Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.

Family Tree

Built On

Mistral 7B

Lineage

Mistral 7B→Mixtral 8x7B

Successors (2)

Mistral Large 2 Nous Hermes 2

Related Research (7)

Sparse MoEScaling

2017 · Google

Introduced sparsely-gated Mixture-of-Experts layers for scaling model capacity without proportional compute increase.

Switch TransformersScaling

2021 · Google

Simplified MoE routing to scale to trillions of parameters efficiently. Influenced Mixtral and GPT-4/5 MoE architectures.

GShardArchitecture

2020 · Google

Scaled Mixture-of-Experts to 600 billion parameters with automatic model parallelism across thousands of TPUs, showing how to train models far beyond …

RoPEArchitecture

2021 · Zhuiyi Technology

Introduced rotary position embeddings that encode position via rotation matrices, enabling better length generalization. Used by virtually every moder…

Grouped-Query AttentionArchitecture

2023 · Google Research

Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, reducing KV cache memory while maintaining quality…

SwiGLUArchitecture

2020 · Google

Showed that SwiGLU activation (Swish + Gated Linear Unit) significantly improves Transformer FFN quality with minimal compute overhead.

Mistral 7BScaling

2023 · Mistral AI

Introduced sliding window attention and demonstrated that a 7B model could outperform LLaMA 2 13B on all benchmarks.

External Links

Research Paper Announcement

More from Mistral AI

Mistral 7B2023-09 · 7B

Mistral Large 22024-07 · 123B

Mistral Small 42026-03 · —

Mistral Medium 3.52026-03 · —

Codestral2024-05 · 22B

Pixtral Large2024-11 · 124B

PreviousMistral 7B

NextCodestral