LLaMA

Meta · February 2023

◌ legacyOpen Weightdecoder onlytext

Parameters7B - 65B

Context Window2K tokens

Variants7B, 13B, 33B, 65B

Why It Matters

Democratized AI research by releasing powerful models openly. Spawned an entire ecosystem of open-source AI development including Alpaca, Vicuna, and hundreds of community fine-tunes.

Description

Meta's first openly released large language model, available in sizes from 7 billion to 65 billion parameters. Despite being smaller than many competitors, it outperformed models like GPT-3 by training more efficiently on higher-quality data — proving that smarter training matters more than sheer size.

Notable Milestones

▸Sparked the open-source LLM movement
▸Basis for Stanford Alpaca and UC Berkeley Vicuna
▸Proved smaller well-trained models can beat larger ones

Key Innovations

Open Weight

Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.

Scaling Laws

Scaling LawsMathematical relationships showing how model performance improves predictably with more data, compute, and parameters.

Family Tree

Successors (5)

LLaMA 2 WizardLM Airoboros Alpaca Vicuna

Related Research (6)

TransformerTransformer

2017 · Google Brain

Introduced the Transformer architecture using self-attention mechanisms, replacing RNNs entirely. Enabled parallel training and superior long-range de…

Scaling Laws (Kaplan)Scaling

2020 · OpenAI

Found that model performance follows power laws in compute, parameters, and data. Provided the mathematical framework for scaling decisions.

ChinchillaScaling

2022 · DeepMind

Challenged Kaplan's scaling laws by showing data should scale equally to parameters. 70B Chinchilla outperformed 280B Gopher.

LLaMAScaling

2023 · Meta AI

Showed that smaller models trained on significantly more data (following Chinchilla scaling laws) could match or exceed the performance of much larger…

RoPEArchitecture

2021 · Zhuiyi Technology

Introduced rotary position embeddings that encode position via rotation matrices, enabling better length generalization. Used by virtually every moder…

SwiGLUArchitecture

2020 · Google

Showed that SwiGLU activation (Swish + Gated Linear Unit) significantly improves Transformer FFN quality with minimal compute overhead.

External Links

Research Paper Announcement

More from Meta LLaMA

LLaMA 22023-07 · 7B - 70B

LLaMA 32024-04 · 8B / 70B

LLaMA 3.12024-07 · 8B / 70B / 405B

LLaMA 3.22024-09 · 1B / 3B / 11B / 90B

LLaMA 3.32024-12 · 70B

LLaMA 42025-04 · 17B active (Scout) / larger (Maverick)

MusicGen2023-06 · 3.3B

CodeLlama2023-08 · 7B - 70B

NextMusicGen