Tülu 3

Allen Institute for AI · November 2024

● activeOpen Sourcedecoder onlytext

Parameters8B - 70B

Context Window128K tokens

Variants8B, 70B

Why It Matters

Allen AI's instruction-tuned model that proved transparent post-training (DPO + PPO) could match proprietary RLHF quality.

Description

Allen AI's instruction-tuned model family, available in 8B and 70B parameter sizes with a 128K token context window. Fine-tuned using transparent post-training techniques including DPO (Direct Preference Optimization — a method that teaches the model human preferences without needing a separate reward model) and PPO (Proximal Policy Optimization — a reinforcement learning method that gradually improves the model's responses based on human feedback).

Key Innovations

Instruction Tuning

Instruction TuningFine-tuning a model on instruction-response pairs so it follows user commands more reliably.

RLHF

RLHFReinforcement Learning from Human Feedback — training models to align with human preferences by having humans rank outputs.

Family Tree

Built On

OLMo 2

Lineage

OLMo→OLMo 2→Tülu 3

Related Research (1)

DPOAlignment

2023 · Stanford University

Showed that preference learning could be formulated as a simple classification problem on pairs of outputs, eliminating the need for a separate reward…

More from Allen AI

OLMo2024-02 · 7B

OLMo 22024-11 · 7B - 13B

Molmo2024-09 · 7B - 72B

PreviousOLMo 2