OmniGuide Universal Guidance Fields for Enhancing Generalist Robot Policies

University of Pennsylvania
*Equal Contribution

Abstract

Vision-language-action (VLA) models have shown great promise as generalist policies for a large range of relatively simple tasks. However, they demonstrate limited performance on more complex tasks, such as those requiring complex spatial or semantic understanding, manipulation in clutter, or precise manipulation. We propose OmniGuide, a flexible framework that improves VLA performance on such tasks by leveraging arbitrary sources of guidance, such as 3D foundation models, semantic-reasoning VLMs, and human pose models. We show how many kinds of guidance can be naturally expressed as differentiable energy functions with task-specific attractors and repellers located in 3D space, that influence the sampling of VLA actions. In this way, OmniGuide enables guidance sources with complementary task-relevant strengths to improve a VLA model's performance on challenging tasks. Extensive experiments in both simulation and real-world environments, across diverse sources of guidance, demonstrate that OmniGuide significantly enhances the performance of state-of-the-art generalist policies (e.g., Ο€0.5, GR00T N1.6) across success (68.2%) and safety (86.5%) rates. Critically, our unified framework matches or surpasses the performance of prior methods designed to incorporate specific sources of guidance into VLA policies.

Method

Generalist robot policies (VLAs) are often "jacks-of-all-trades, masters of none." While they understand broad instructions, they often lack the "last-mile" precision needed for complex spatial reasoning or avoiding tight collisions. OmniGuide addresses this by providing inference-time guidance leveraging external sources of information, such as 3D foundation models, semantic-reasoning VLMs, and human pose models.

OmniGuide works with any generative policy (Diffusion or Flow Matching) by steering the robot's towards task-relevant regions (attractors) and away from collision obstacles (repellers). During the denoising process, we estimate the "clean" action, project it into 3D Cartesian space ($X$) using a differentiable kinematics model, and calculate a task-specific energy $\mathcal{L}_y(X)$. We then use the gradient of this energy to steer the robot's plan:

$$A^{\tau+\delta} = A^\tau + \delta \left( v_\theta(A^\tau, o) - \lambda \text{clip}(\nabla_{A^\tau} \mathcal{L}_y(X), \alpha) \right)$$

This allows us to blend the VLA’s natural movement with external "expert" knowledge from other foundation models. We experimented with three modalities of guidance: collision avoidance, semantic grounding, and human imitation.

Three Modalities of Guidance

Collision Avoidance

Repulsive Fields

We convert environment point clouds into a 3D Signed Distance Function (SDF). This creates a "safety buffer" that repels the robot arm from obstacles in real-time.

Semantic Grounding

Attractive Targets

Vision-Language Models identify task-relevant objects in pixel space. We back-project these to 3D centroids that act as a gravitational pull for the robot's gripper.

Human Imitation

Sparse Trajectories

We extract human wrist positions from one-shot videos. A monotonic matching strategy aligns these to the robot's trajectory to guide the manipulation.

Results

We evaluate OmniGuide on a diverse set of tasks spanning three modalities of guidance against the baseline Ο€0.5 in the real-world and GR00T N1.6 in simulation using the RoboCasa environments. See the results below! OmniGuide consistently improves base VLAs on all tasks.

BibTeX

TODO