Vision-Language-Action Models (VLAs): Understanding and Application in the Modern World

Vision-language models (VLMs) are AI models that combine both vision and language modalities. These models can process both images and natural language. Researchers are expanding VLMs by including an action layer. These models can process visual and textual information and generate sequences of decisions for real-world scenarios. This fusion of vision, language, and action within computational models is emerging as a potentially useful AI paradigm for a wide range of applications. Vision-Language-Action Models (VLAs) are designed to perceive visual data, interpret it using linguistic context, and subsequently generate a corresponding action or response. In essence, VLAs emulate human-like cognition, where sight, comprehension, and action intertwine.

At its core, VLAs marries computer vision with natural language processing. The vision component enables machines to “see” or interpret visual data. This is complemented by the language component which processes this visual information in linguistic terms, enabling the machine to “understand” or describe what it sees. Finally, the action component facilitates a response, whether that be a decision, movement, or another specific output.

Wayve recently introduced LINGO-1, an open-loop driving commentator. Some key quotes from their announcement:

The use of natural language in training robots is still in its infancy, particularly in autonomous driving. Incorporating language along with vision and action may have an enormous impact as a new modality to enhance how we interpret, explain and train our foundation driving models. By foundation driving models, we mean models that can perform several driving tasks, including perception (perceiving the world around them), causal and counterfactual reasoning (making sense of what they see), and planning (determining the appropriate sequence of actions). We can use language to explain the causal factors in the driving scene, which may enable faster training and generalisation to new environments.

We can also use language to probe models with questions about the driving scene to more intuitively understand what it comprehends. This capability can provide insights that could help us improve our driving models’ reasoning and decision-making capabilities. Equally exciting, VLAMs open up the possibility of interacting with driving models through dialogue, where users can ask autonomous vehicles what they are doing and why. This could significantly impact the public’s perception of this technology, building confidence and trust in its capabilities.

In addition to having a foundation driving model with broad capabilities, it is also eminently desirable for it to efficiently learn new tasks and quickly adapt to new domains and scenarios where we have small training samples. Here is where natural language could add value in supporting faster learning. For instance, we can imagine a scenario where a corrective driving action is accompanied by a natural language description of incorrect and correct behaviour in this situation. This extra supervision can enhance few-shot adaptations of the foundation model. With these ideas in mind, our Science team is exploring using natural language to build foundation models for end-to-end autonomous driving.

These models enable us to ask questions so we can better understand what the model “sees” and to better understand its reasoning.  Here’s an example:

Language can help interpret and explain AI model decisions, a potentially useful application when it comes to adding transparency and understanding to AI. It can also help train models, enabling them to more quickly adapt to changes in the real-world.

Related

In today’s interconnected world, the ripple effects of a single

“Have you been here before?” the clerk asked. For a

The dust has settled. The results tallied. Black Friday, CyberMonday,