Grounded Vision-Language Interpreter for Integrated Task and Motion Planning
In ReviewTL;DR While recent advances in vision-language models (VLMs) have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. This paper proposes ViLaIn-TAMP, a hybrid planning framework for enabling verifiable, interpretable, and autonomous robot behaviors.
ViLaIn-TAMP is a hybrid planning framework that bridges the gap between vision-language models and classical symbolic planners. While recent advances in vision-language models (VLMs) have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. Conversely, classical symbolic planners offer rigorous safety verification but require significant expert knowledge for setup.
Our framework comprises three main components: (1) a Vision-Language Interpreter (ViLaIn) for translating multimodal inputs into PDDL problems, (2) a sequence-before-satisfy TAMP module for finding symbolically complete, collision-free action plans, and (3) a corrective planning module for refining outputs based on grounded failure feedback.
The ViLaIn-TAMP framework consists of three major components: (1) a Vision-Language Interpreter (ViLaIn) for translating multimodal inputs into PDDL problems, (2) a sequence-before-satisfy TAMP module for finding symbolically complete, collision-free action plans, and (3) a corrective planning module for refining outputs based on grounded failure feedback.
We evaluate ViLaIn-TAMP on five manipulation tasks in a cooking domain:
We consider four model configurations to evaluate ViLaIn-TAMP:
The results show that (1) ViLaIn-TAMP outperforms the baseline approach by a large margin, and (2) CP constantly improves success rates. The proposed closed-loop corrective architecture exhibits a more than 30% higher mean success rate for ViLaIn-TAMP compared to without corrective planning.
Our work makes several key contributions to the field of robot planning:
Hybrid Planning Framework: We bridge the gap between VLMs and classical symbolic planners, combining the strengths of both approaches.
Interpretable and Verifiable: Unlike black-box VLM planners, our approach provides interpretable problem specifications and safety guarantees through symbolic reasoning.
Corrective Planning: Our framework includes a corrective planning module that learns from failures and improves performance through iterative refinement.
Domain-Agnostic: The ViLaIn component works with off-the-shelf VLMs without requiring additional domain-specific training.
Comprehensive Evaluation: We demonstrate the effectiveness of our approach across multiple manipulation tasks in a cooking domain.
# Arxiv version
@misc{siburian2025vilaintamp,
title={Grounded Vision-Language Interpreter for Integrated Task and Motion Planning},
author={Jeremy Siburian and Keisuke Shirai and Cristian C. Beltran-Hernandez and Masashi Hamaya and Michael Görner and Atsushi Hashimoto},
year={2025},
eprint={2506.03270},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2506.03270},
}