RT-2

Google DeepMind · July 2023

● activeClosedmultimodal

Why It Matters

First model to demonstrate that large vision-language models could directly control robots, translating web-scale knowledge into physical actions without task-specific training.

Description

Google DeepMind's Robotics Transformer 2 that bridges language understanding and physical robot actions. Converts vision-language model outputs directly into robot motor commands, enabling robots to follow natural language instructions and reason about their physical environment.

Key Innovations

Multimodal

MultimodalProcessing multiple types of input (text, images, audio, video) in a single model.

robotics

vision-language-action

Family Tree

Built On

PaLM 2

Lineage

PaLM→PaLM 2→RT-2

External Links

Research Paper

More from Robotics / Embodied

PaLM-E2023-03 · 562B

PreviousPaLM-E