Google releases local VLA model, is the "Android system" of the robot world coming?

Jun 26, 2025

On June 25, Google DeepMind officially released the first Visual-Language-Action (VLA) model that can be fully deployed locally on robots - Gemini Robotics On-Device.



This also means that Embodied AI is moving from relying on cloud computing power to a key turning point in local autonomous operation, and it also opens a new window of possibility for industrial implementation.


Fast learning with a small number of demonstrations, with cross-robot morphology generalization capabilities


The deployment of embodied intelligence has always faced two major challenges: one is the heavy reliance on cloud computing resources, which limits the robot's ability to work independently in an unstable or no network environment; the other is that the model is large in size and difficult to run efficiently on the robot's limited computing resources.


According to the official introduction, Gemini Robotics On-Device can run locally on computing-constrained robot devices while demonstrating excellent versatility and task generalization capabilities. Since the model does not rely on data networks, it has significant advantages for latency-sensitive applications.



More importantly, the model demonstrates a high level of versatility and stability in actual operation. In the demonstration video shown by Google DeepMind, the robot completed tasks such as "putting a Rubik's cube into a bag" and "unzipping a bag" without network connection, covering multiple links such as perception, semantic understanding, spatial reasoning and high-precision execution.



DeepMind researchers said that it has the versatility and flexibility of Gemini Robotics, can handle various complex two-handed tasks immediately, and can learn new skills with only 50-100 demonstrations. An engineer in the field of robotics told reporters that most robots currently need hundreds of trainings to complete a task. This means that Google's new model greatly expands the scope of application and deployment flexibility of the model.


It is worth noting that although the model was originally trained for a specific robot, it can be generalized to different robot forms, such as two-arm robots and humanoid robots, greatly expanding its application potential. In the demonstration video, it can be seen that on the two-arm Franka, the model can execute general instructions, including processing previously unseen objects and scenes, completing dexterous tasks such as folding clothes, or performing industrial belt assembly tasks that require precision and dexterity.


In addition, Google opened the fine-tuning function of the VLA model for the first time, which means that engineers or robot companies can customize the model based on their own data to optimize its performance in specific tasks, scenarios or hardware platforms, and further improve application efficiency and practical value. At the same time, Google also launched the Gemini Robotics SDK to facilitate developers to evaluate and quickly adjust the model. From these actions, it can be seen that Google hopes to provide an open, general and easy-to-develop platform for the field of robotics, just as the Android system has done for the smartphone industry.


Embodied intelligence is entering the "end-side era"


"This marks that robots can finally enter the real environment. An expert in the field of embodied intelligence told Blue Whale Technology reporters, "In the past, due to bandwidth and computing power, many robot AIs could only be demonstrated. Google's progress this time means that the universal model can truly run on hardware terminals, and in the future it will be able to perform complex operations without relying on an Internet connection. ”


Embodied intelligence was once considered to be the bridge from AGI to the real world, and the VLA model with local deployment capabilities is a key link in the opening of this bridge. The aforementioned expert told the Blue Whale Technology reporter that the local VLA model will make robots more suitable for sensitive scenarios such as home, medical care, and education, and solve core challenges such as data privacy, real-time response, security and stability.


In the past few years, the "end-side deployment" of large language models has become one of the important trends. From the initial reliance on large-scale cloud computing resources to the ability to run locally on edge devices such as mobile phones and tablets, the compression optimization, reasoning acceleration and hardware collaboration of the model have made continuous progress.


The same evolutionary path is gradually happening in the field of embodied intelligence. As the core architecture of embodied intelligence, the VLA model (vision-language-action) is essentially to enable robots to understand tasks and take actions from multimodal information. Previously, such models often needed to rely on powerful cloud resources for reasoning and decision-making, and were restricted by network bandwidth, computing power consumption and real-time bottlenecks, making it difficult to operate efficiently in complex real environments.


Gemini Robotics released by Google this time On-Device means that embodied intelligence is entering the "end-side era" similar to language models. It not only achieves stable operation under limited computing power, but also has good versatility and migration capabilities, and can support rapid learning and adaptation to different tasks and robot forms. This release may also trigger a chain reaction in the industry. With the continuous evolution of AI computing power and model architecture, "edge intelligence" is moving from the traditional Internet of Things (IoT) to a more advanced stage represented by embodied intelligence.


Local VLA models will become the next battleground. "Currently, the differences in the body structure, degrees of freedom and sensor configuration of various types of robots make it difficult to achieve a unified software architecture." An investor who focuses on the field of robotics said, "Once the hardware standards tend to be unified, just like the specifications formed by common components such as USB interfaces, keyboards, and screens in the smartphone ecosystem, it will greatly promote the standardization of algorithms and the realization of local deployment. "He believes that the vision of the "robot Android ecosystem" that Google is building indicates that a more standardized, easy-to-develop and popular embodied intelligence is expected to come.


However, the challenges in actual implementation should not be underestimated. The diversity and complexity of robot hardware are still prominent issues. The various robot hardware on the market means that even powerful general models need to be carefully adapted and tuned for each specific hardware. In addition, to truly implement in massive and diverse actual application scenarios, the cost of data collection and annotation may be extremely high, especially in industrial or specific service scenarios that require professional operating knowledge and equipment.


More importantly, robots need to maintain robustness in extremely complex, dynamic and unpredictable real-world environments. Light changes, object occlusion, unstructured cluttered environments, and subtle differences in human-computer interactions will all pose severe tests to the model's real-time perception and decision-making capabilities. Ensuring that robots can maintain a high level of stability and security in various actual scenarios is a difficult problem that must be continuously overcome in the future development of embodied intelligence.

The picture is from the Internet.
If there is any infringement, please contact the platform to delete it.