Visuo-Tactile Zero-Shot Object Recognition with Vision-Language Model

IROS2024
1Keio University2OMRON SINIC X Corporation

Overview

Tactile perception is vital, especially when distinguishing visually similar objects. We propose an approach to incorporate tactile data into a Vision-Language Model (VLM) for visuo-tactile zero-shot object recognition. Our approach leverages the zero-shot capability of VLMs to infer tactile properties from the names of tactilely similar objects. The proposed method translates tactile data into a textual description solely by annotating object names for each tactile sequence during training, making it adaptable to various contexts with low training costs. The proposed method was evaluated on the FoodReplica and Cube datasets, demonstrating its effectiveness in recognizing objects that are difficult to distinguish by vision alone.

Video

Method

The tactile embedding network learns the tactile embedding from the tactile sequence. These embeddings are converted to textual descriptions in the tactile-to-text database. During inference, the Vision Language Model (VLM) receives a textual description along with the visual image and outputs the most likely class label for the input object in a zero-shot manner.

Proposed pipeline.

Citation

@inproceedings{ueda2024visuotactile,
  title={Visuo-Tactile Zero-Shot Object Recognition with Vision-Language Model},
  author={Ueda, Shiori and Hashimoto, Atsushi and Hamaya, Masashi and Tanaka, Kazutoshi and Saito, Hideo},
  booktitle={IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  year={2024}
}