Tülu 3
Allen Institute for AI · November 2024
Why It Matters
Allen AI's instruction-tuned model that proved transparent post-training (DPO + PPO) could match proprietary RLHF quality.
Description
Allen AI's instruction-tuned model family, available in 8B and 70B parameter sizes with a 128K token context window. Fine-tuned using transparent post-training techniques including DPO (Direct Preference Optimization — a method that teaches the model human preferences without needing a separate reward model) and PPO (Proximal Policy Optimization — a reinforcement learning method that gradually improves the model's responses based on human feedback).
Key Innovations
Related Research (1)
Showed that preference learning could be formulated as a simple classification problem on pairs of outputs, eliminating the need for a separate reward…