Although we sometimes call chatbots like Gemini and ChatGPT "robots", generative AI is playing an increasingly important role in real physical robots. Following the release of Gemini Robotics earlier this year, Google DeepMind has now launched a new on-device VLA (visual language action) model to control the robot. Unlike previous versions, this version does not rely on cloud components, allowing the robot to operate completely autonomously.
Carolina Parada, head of Google DeepMind's robotics department, said this AI robotics approach can make robots more reliable in complex environments. This is also the first version of Google's robot model that developers can tune for specific uses.
Robotics is a unique challenge for AI because robots not only exist in the physical world, but also change their environment. Whether asking a robot to move blocks or tie shoelaces, it is difficult to predict every situation a robot may encounter. Traditional methods of training robot actions through reinforcement learning are very slow, but generative AI allows for greater generalization.
"It leverages Gemini's multimodal world understanding to do entirely new tasks," Carolina Parada explains. "This allows Gemini to not only generate text, write poetry, summarize articles, but also write code, generate images, and generate robot actions."
General purpose robot, no cloud support required
In previous versions of Gemini Robotics (which is still the "best" version of Google Robotics), the platform ran a hybrid system with a small model on the robot and a larger model running in the cloud. You may have seen chatbots "think" for a few seconds while generating output, but robots need to react quickly. If you tell a robot to pick up and move an object, you don't want it to pause while generating each step. Local models allow for fast adaptation, while server-based models can help with complex reasoning tasks. Google DeepMind has now released the local model as a standalone VLA, and it's surprisingly powerful.
The new Gemini Robotics on-device model is only slightly less accurate than the hybrid version. According to Parada, many tasks can be used right out of the box. "As we interact with the robots, we find that they are surprisingly good at understanding new situations," Parada told Ars.
By releasing the model with a full SDK, the team hopes that developers will give Gemini-powered robots new tasks and expose them to new environments, which may reveal actions that the model's standard tuning can't handle. Using the SDK, robotics researchers were able to adapt the VLA to new tasks with just 50 to 100 demonstrations.
In AI robotics, "demonstrations" are different than in other areas of AI research. Parada explained that demonstrations typically involve teleoperating a robot—manually controlling a mechanical device to complete a task to tune the model so that it can handle that task autonomously. While synthetic data is an element of Google's training, it's not a substitute for real data. "We still find that for the most complex, elaborate behaviors, we need real data," Parada said. "But there's a lot that can be done with simulation."
However, these highly complex behaviors may be beyond the capabilities of an on-device VLA. It should be able to handle simple actions like tying shoes (a traditionally difficult task for AI robots) or folding a shirt without any problems. But if you want a robot to make you a sandwich, it might need a more powerful model to do the multi-step reasoning necessary to put the bread in the right place.
The team sees the Gemini Robotics on-device version as being well suited for environments where cloud connectivity is spotty or nonexistent. Processing the robot’s visual data locally is also better for privacy, such as in medical settings.
Building safe robots
Whether it’s a chatbot that delivers dangerous information or a Terminator-like robot, the safety of AI systems is always a concern. We’ve all seen generative AI chatbots and image generators produce false information in their output, and the generative system that drives Gemini Robotics is no exception—the model doesn’t get it right every time, but giving the model a physical entity with a cold metal gripper makes the problem even more difficult.
To ensure that the robot behaves safely, Gemini Robotics takes a multi-layered approach. “With the full Gemini Robotics, you’re connected to a model that can reason about what is a safe behavior,” Parada said. "Then have it talk to the VLA which actually generates the options, and the VLA then calls the low-level controller, which typically has safety-critical components, like how much force can be applied or how fast the arm can move."
Importantly, the new on-device model is just a VLA, so developers will need to build in safety mechanisms themselves. However, Google recommends that they copy what the Gemini team did. Developers in the early testing program are advised to connect their systems to the standard Gemini Live API, which has a security layer. They should also implement low-level controllers for key safety checks.
Anyone interested in testing the on-device version of Gemini Robotics should apply to join Google's trusted testing program. Google's Carolina Parada said there have been many breakthroughs in robotics over the past three years, and this is just the beginning - the current release of Gemini Robotics is still based on Gemini 2.0. Parada pointed out that the Gemini Robotics team is usually one version behind Gemini development, and Gemini 2.5 is considered a huge improvement in chatbot functionality. Maybe the same will be true for robots.