In context: Sci-fi stories often depict robots capable of reliable interactions with humans, and Google is working to bring this futuristic dream a bit closer to reality. Mountain View engineers have developed a new AI model that helps robotics understand and execute human-safe actions.

Google describes Robotics Transformer 2, or RT-2 for short, as a vision-language-action (VLA) model. The new AI model was trained on text and images collected from the web, allowing it to generate "robotic actions." In contrast, generative AI-based chatbots are designed to produce text snippets that develop ideas and concepts.

Google's DeepMind team developed RT-2 to transfer web knowledge to robotic control. Unlike chatbots, robots require real-world grounding to be helpful to humans. Google acknowledges that achieving this has always been a herculean effort, as robots must handle complex, abstract tasks in highly variable and unknown environments.

Training models like RT-2 is a significantly more complex undertaking compared to training large language models (LLM) for chatbots. According to Google, a robot's knowledge must extend beyond simply knowing about an apple. It needs to recognize an apple within a context, differentiate it from a red ball, understand how to pick it up and handle various related tasks.

Historically, training practical "real-life" robots has demanded billions of data points regarding the physical world. However, RT-2 introduces a new, more efficient approach. Leveraging RT-1's capability to generalize information across systems, RT-2 can create a single model capable of "complex reasoning" with only a tiny amount of robot training data. This lighter approach signifies a notable advancement in robot training methods.

Google claims that RT-2 can transfer knowledge from a vast corpus of web data and handle complex situations and human-made requests, such as disposing of a "piece of trash." The AI comprehends the concept of "trash" and knows how to dispose of it, even without explicit programming for that specific action. This ability showcases the model's capacity for learning and performing tasks beyond its initial training.

Google engineers conducted over 6,000 "robotic trials" of the RT-2 model. In tasks based on the data used for training, the models performed on par with the previous generation (RT-1) model. However, RT-2's performance significantly improved in novel, unknown scenarios, doubling from RT-1's 32-percent completion rate to an impressive 62 percent. This enhanced adaptability in unfamiliar situations substantially advances the model's capabilities.

According to Google, RT-2 exemplifies how advancements in generative AI and LLM technology are rapidly influencing robotics, offering great potential for more practical and versatile general-purpose robots. While acknowledging that there is still much work to be done, the DeepMind team is optimistic about the path ahead.