RT-2
Google DeepMind · July 2023
● activeClosedmultimodal
Why It Matters
First model to demonstrate that large vision-language models could directly control robots, translating web-scale knowledge into physical actions without task-specific training.
Description
Google DeepMind's Robotics Transformer 2 that bridges language understanding and physical robot actions. Converts vision-language model outputs directly into robot motor commands, enabling robots to follow natural language instructions and reason about their physical environment.
Key Innovations
Multimodal
MultimodalProcessing multiple types of input (text, images, audio, video) in a single model.
robotics
vision-language-action