Grounded Vision-Language Interpreter for Integrated Task and Motion Planning

In Review
1The University of Tokyo2OMRON SINIC X Corporation3University of Hamburg* Equal contribution

TL;DR While recent advances in vision-language models (VLMs) have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. This paper proposes ViLaIn-TAMP, a hybrid planning framework for enabling verifiable, interpretable, and autonomous robot behaviors.

Overview

ViLaIn-TAMP is a hybrid planning framework that bridges the gap between vision-language models and classical symbolic planners. While recent advances in vision-language models (VLMs) have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. Conversely, classical symbolic planners offer rigorous safety verification but require significant expert knowledge for setup.

Our framework comprises three main components: (1) a Vision-Language Interpreter (ViLaIn) for translating multimodal inputs into PDDL problems, (2) a sequence-before-satisfy TAMP module for finding symbolically complete, collision-free action plans, and (3) a corrective planning module for refining outputs based on grounded failure feedback.

Framework Overview

The ViLaIn-TAMP framework consists of three major components: (1) a Vision-Language Interpreter (ViLaIn) for translating multimodal inputs into PDDL problems, (2) a sequence-before-satisfy TAMP module for finding symbolically complete, collision-free action plans, and (3) a corrective planning module for refining outputs based on grounded failure feedback.

ViLaIn-TAMP Framework
ViLaIn-TAMP Framework

Framework Details

ViLaIn Framework ViLaIn Framework Architecture
TAMP Framework TAMP Module Architecture

Manipulation Tasks

Robotic System
Robotic System

We evaluate ViLaIn-TAMP on five manipulation tasks in a cooking domain:

  • Pick and Place: moves a target object to a desired location.
  • Pick Obstacles Dual Arm: moves a target object to a desired location while removing another object which occupies the target location with dual arms.
  • Pick Obstacles Single Arm: the same task as Pick Obstacles Dual Arm but with a single arm.
  • Slice Food: slices a food (e.g., vegetable or fruit) using a tool (e.g., knife).
  • Slice and Serve: slices a food using a tool and serves the slices in a desired location (e.g., bowl or plate).

Experimental Results

We consider four model configurations to evaluate ViLaIn-TAMP:

  • ViLaIn-TAMP-CP: A ViLaIn-TAMP with corrective planning (CP).
  • ViLaIn-TAMP-No-CP: ViLaIn-TAMP without CP.
  • Baseline-CP: An approach that uses VLMs to directly generate actions plans with CP.
  • Baseline-No-CP: The baseline approach without CP.

The results show that (1) ViLaIn-TAMP outperforms the baseline approach by a large margin, and (2) CP constantly improves success rates. The proposed closed-loop corrective architecture exhibits a more than 30% higher mean success rate for ViLaIn-TAMP compared to without corrective planning.

Experimental Results Main experimental results showing success rates across different tasks and configurations

Key Contributions

Our work makes several key contributions to the field of robot planning:

  1. Hybrid Planning Framework: We bridge the gap between VLMs and classical symbolic planners, combining the strengths of both approaches.

  2. Interpretable and Verifiable: Unlike black-box VLM planners, our approach provides interpretable problem specifications and safety guarantees through symbolic reasoning.

  3. Corrective Planning: Our framework includes a corrective planning module that learns from failures and improves performance through iterative refinement.

  4. Domain-Agnostic: The ViLaIn component works with off-the-shelf VLMs without requiring additional domain-specific training.

  5. Comprehensive Evaluation: We demonstrate the effectiveness of our approach across multiple manipulation tasks in a cooking domain.

Citation

# Arxiv version
@misc{siburian2025vilaintamp,
    title={Grounded Vision-Language Interpreter for Integrated Task and Motion Planning}, 
    author={Jeremy Siburian and Keisuke Shirai and Cristian C. Beltran-Hernandez and Masashi Hamaya and Michael Görner and Atsushi Hashimoto},
    year={2025},
    eprint={2506.03270},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2506.03270}, 
}