Gemma 3
Google DeepMind · March 2025
● activeOpen Sourcedecoder onlymultimodal
Parameters1B / 4B / 12B / 27B
Context Window128K tokens
Variants1B, 4B, 12B, 27B
Why It Matters
Brought multimodal capabilities to the open-weight world at accessible sizes. Proved that small, downloadable models could understand images and handle very long documents.
Description
First multimodal Gemma — capable of understanding both text and images, not just text. Features a 128K-token context window (roughly 96,000 words) and supports over 140 languages. Available in 1B, 4B, 12B, and 27B sizes, all designed to run efficiently on a single GPU or TPU. Shipped alongside ShieldGemma 2 for content safety filtering.
Notable Milestones
- ▸Supports 140+ languages for multilingual applications
- ▸ShieldGemma 2 safety classifier included for responsible deployment
Benchmark Scores
MATHMATH benchmark — competition-level problems
69.0%GPQAGraduate-level science QA
42.4%Key Innovations
Multimodal
MultimodalProcessing multiple types of input (text, images, audio, video) in a single model.
Open Weight
Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.
Long Context
Long ContextAbility to process very long inputs (100K+ tokens), enabling analysis of entire codebases or books.
Related Research (1)
Grouped-Query AttentionArchitecture
2023 · Google Research
Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, reducing KV cache memory while maintaining quality…