One of the big challenges of robotics is the amount of effort that has to be put into training machine learning models for each robot, task, and environment. Now, a new project by Google DeepMind and 33 other research institutions aims to address this challenge by creating a general-purpose AI system that can work with different types of physical robots and perform many tasks.
“What we have observed is that robots are great specialists, but poor generalists,” Pannag Sanketi, Senior Staff Software Engineer at Google Robotics, told VentureBeat. “Typically, you have to train a model for each task, robot, and environment. Changing a single variable often requires starting from scratch.”
To overcome this and make it far easier and faster to train and deploy robots, the new project, dubbed Open-X Embodiment, introduces two key components: a dataset containing data on multiple robot types and a family of models capable of transferring skills across a wide range of tasks. The researchers put the models to the test in robotics labs and on different types of robots, achieving superior results in comparison to the commonly used methods for training robots.
Typically, every distinct type of robot, with its unique set of sensors and actuators, requires a specialized software model, much like how the brain and nervous system of each living organism have evolved to become attuned to that organism’s body and environment.
The Open X-Embodiment project was born out of the intuition that combining data from diverse robots and tasks could create a generalized model superior to specialized models, applicable to all kinds of robots. This concept was partly inspired by large language models (LLMs), which, when trained on large, general datasets, can match or even outperform smaller models trained on narrow, task-specific datasets. Surprisingly, the researchers found that the same principle applies to robotics.
To create the Open X-Embodiment dataset, the research team collected data from 22 robot embodiments at 20 institutions from various countries. The dataset includes examples of more than 500 skills and 150,000 tasks across over 1 million episodes (an episode is a sequence of actions that a robot takes each time it tries to accomplish a task).
The accompanying models are based on the transformer, the deep learning architecture also used in large language models. RT-1-X is built on top of Robotics Transformer 1 (RT-1), a multi-task model for real-world robotics at scale. RT-2-X is built on RT-1’s successor RT-2, a vision-language-action (VLA) model that has learned from both robotics and web data and can respond to natural language commands.
The researchers tested RT-1-X on various tasks in five different research labs on five commonly used robots. Compared to specialized models developed for each robot, RT-1-X had a 50% higher success rate at tasks such as picking and moving objects and opening doors. The model was also able to generalize its skills to different environments as opposed to specialized models that are suitable for a specific visual setting. This suggests that a model trained on a diverse set of examples outperforms specialist models in most tasks. According to the paper, the model can be applied to a wide range of robots, from robot arms to quadrupeds.
“For anyone who has done robotics research you’ll know how remarkable this is: such models ‘never’ work on the first try, but this one did,” writes Sergey Levine, associate professor at UC Berkeley and co-author of the paper.
Remarkably, even the smaller RT-1-X model improved across the board *compared to the model each lab was using for their own experiments*! For anyone who has done robotics research you’ll know how remarkable this is: such models “never” work on the first try, but this one did. pic.twitter.com/jSdKT1Q5BH
RT-2-X was three times more successful than RT-2 on emergent skills, novel tasks that were not included in the training dataset. In particular, RT-2-X showed better performance on tasks that require spatial understanding, such as telling the difference between moving an apple near a cloth as opposed to placing it on the cloth.
“Our results suggest that co-training with data from other platforms imbues RT-2-X with additional skills that were not present in the original dataset, enabling it to perform novel tasks,” the researchers write in a blog post that announces Open X and RT-X.
Looking ahead, the scientists are considering research directions that could combine these advances with insights from RoboCat, a self-improving model developed by DeepMind. RoboCat learns to perform a variety of tasks across different robotic arms and then automatically generates new training data to improve its performance.
Another potential direction, according to Sanketi, could be to further investigate how different dataset mixtures might affect cross-embodiment generalization and how the improved generalization materializes.
The team has open-sourced the Open X-Embodiment dataset and a small version of the RT-1-X model, but not the RT-2-X model.
“We believe these tools will transform the way robots are trained and accelerate this field of research,” Sanketi said. “We hope that open sourcing the data and providing safe but limited models will reduce barriers and accelerate research. The future of robotics relies on enabling robots to learn from each other, and most importantly, allowing researchers to learn from one another.”
VentureBeat presents: AI Unleashed – An exclusive executive event for enterprise data leaders. Network and learn with industry peers. Learn More
One of the big challenges of robotics is the amount of effort that has to be put into training machine learning models for each robot, task, and environment. Now, a new project by Google DeepMind and 33 other research institutions aims to address this challenge by creating a general-purpose AI system that can work with different types of physical robots and perform many tasks.
“What we have observed is that robots are great specialists, but poor generalists,” Pannag Sanketi, Senior Staff Software Engineer at Google Robotics, told VentureBeat. “Typically, you have to train a model for each task, robot, and environment. Changing a single variable often requires starting from scratch.”
To overcome this and make it far easier and faster to train and deploy robots, the new project, dubbed Open-X Embodiment, introduces two key components: a dataset containing data on multiple robot types and a family of models capable of transferring skills across a wide range of tasks. The researchers put the models to the test in robotics labs and on different types of robots, achieving superior results in comparison to the commonly used methods for training robots.
Combining robotics data
Typically, every distinct type of robot, with its unique set of sensors and actuators, requires a specialized software model, much like how the brain and nervous system of each living organism have evolved to become attuned to that organism’s body and environment.
Event
AI Unleashed
An exclusive invite-only evening of insights and networking, designed for senior enterprise executives overseeing data stacks and strategies.
The Open X-Embodiment project was born out of the intuition that combining data from diverse robots and tasks could create a generalized model superior to specialized models, applicable to all kinds of robots. This concept was partly inspired by large language models (LLMs), which, when trained on large, general datasets, can match or even outperform smaller models trained on narrow, task-specific datasets. Surprisingly, the researchers found that the same principle applies to robotics.
To create the Open X-Embodiment dataset, the research team collected data from 22 robot embodiments at 20 institutions from various countries. The dataset includes examples of more than 500 skills and 150,000 tasks across over 1 million episodes (an episode is a sequence of actions that a robot takes each time it tries to accomplish a task).
The accompanying models are based on the transformer, the deep learning architecture also used in large language models. RT-1-X is built on top of Robotics Transformer 1 (RT-1), a multi-task model for real-world robotics at scale. RT-2-X is built on RT-1’s successor RT-2, a vision-language-action (VLA) model that has learned from both robotics and web data and can respond to natural language commands.
The researchers tested RT-1-X on various tasks in five different research labs on five commonly used robots. Compared to specialized models developed for each robot, RT-1-X had a 50% higher success rate at tasks such as picking and moving objects and opening doors. The model was also able to generalize its skills to different environments as opposed to specialized models that are suitable for a specific visual setting. This suggests that a model trained on a diverse set of examples outperforms specialist models in most tasks. According to the paper, the model can be applied to a wide range of robots, from robot arms to quadrupeds.
“For anyone who has done robotics research you’ll know how remarkable this is: such models ‘never’ work on the first try, but this one did,” writes Sergey Levine, associate professor at UC Berkeley and co-author of the paper.
RT-2-X was three times more successful than RT-2 on emergent skills, novel tasks that were not included in the training dataset. In particular, RT-2-X showed better performance on tasks that require spatial understanding, such as telling the difference between moving an apple near a cloth as opposed to placing it on the cloth.
“Our results suggest that co-training with data from other platforms imbues RT-2-X with additional skills that were not present in the original dataset, enabling it to perform novel tasks,” the researchers write in a blog post that announces Open X and RT-X.
Taking future steps for robotics research
Looking ahead, the scientists are considering research directions that could combine these advances with insights from RoboCat, a self-improving model developed by DeepMind. RoboCat learns to perform a variety of tasks across different robotic arms and then automatically generates new training data to improve its performance.
Another potential direction, according to Sanketi, could be to further investigate how different dataset mixtures might affect cross-embodiment generalization and how the improved generalization materializes.
The team has open-sourced the Open X-Embodiment dataset and a small version of the RT-1-X model, but not the RT-2-X model.
“We believe these tools will transform the way robots are trained and accelerate this field of research,” Sanketi said. “We hope that open sourcing the data and providing safe but limited models will reduce barriers and accelerate research. The future of robotics relies on enabling robots to learn from each other, and most importantly, allowing researchers to learn from one another.”
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.
Author: Ben Dickson
Source: Venturebeat
Reviewed By: Editorial Team