RT-2

Google DeepMind · July 2023

activeClosedmultimodal

Why It Matters

First model to demonstrate that large vision-language models could directly control robots, translating web-scale knowledge into physical actions without task-specific training.

Description

Google DeepMind's Robotics Transformer 2 that bridges language understanding and physical robot actions. Converts vision-language model outputs directly into robot motor commands, enabling robots to follow natural language instructions and reason about their physical environment.

Key Innovations

Multimodal
MultimodalProcessing multiple types of input (text, images, audio, video) in a single model.
robotics
vision-language-action

Family Tree

Built On

Lineage

PaLMPaLM 2RT-2

External Links

More from Robotics / Embodied