Multi-Agent Behavior Retrieval: Retrieval-Augmented Policy Training for Cooperative Manipulation by Mobile Robots

IROS2024
1OMRON SINIC X Corporation* work done as an intern at OMRON SINIC X.

TL;DR introduces the Multi-Agent Coordination Skill Database, allowing multiple mobile robots to efficiently use past memories to adapt to new tasks.

Overview

Due to the complex interactions between agents, learning multi-agent control policy often requires a prohibitive amount of data. This paper aims to enable multi-agent systems to effectively utilize past memories to adapt to novel collaborative tasks in a data-efficient fashion. We propose the Multi-Agent Coordination Skill Database, a repository for storing a collection of coordinated behaviors associated with key vectors distinctive to them. Our Transformer-based skill encoder effectively captures spatio-temporal interactions that contribute to coordination and provides a unique skill representation for each coordinated behavior. By leveraging only a small number of demonstrations of the target task, the database enables us to train the policy using a dataset augmented with the retrieved demonstrations. Experimental evaluations demonstrate that our method achieves a significantly higher success rate in push manipulation tasks compared with baseline methods like few-shot imitation learning. Furthermore, we validate the effectiveness of our retrieve-and-learn framework in a real environment using a team of wheeled robots.

Video

Multi-Agent Behavior Retrieval

Retrieve-and-Learn Framework

Given a large task-agnostic prior dataset Dprior\mathcal D_{\text{prior}} and a few demonstrations Dtarget\mathcal D_{\text{target}} collected from the target task, our main objective is to retrieve coordination skills from Dprior\mathcal D_{\text{prior}} that facilitate downstream policy learning for the target task.

Our retrieve-and-learn framework consists of three primary components: (i) Database Construction, (ii) Coordination Skill Retrieval, and (iii) Retrieval-Augmented Policy Training.

Multi-Agent Coordination Skill Database

Dprior\mathcal D_{\text{prior}} consists of task-agnostic demonstrations of NN mobile agents. To effectively retrieve demonstrations from DpriorD_{\text{prior}} that are relevant to Dtarget\mathcal D_{\text{target}}, we seek an abstract representation to measure the similarity between the two multi-agent demonstrations. We refer to this as multi-agent coordination skill representation, a compressed vector representation that is distinctive to the specific coordination behavior. For NN agents’ states st\mathbf s_t , we aim to learn a skill encoder E\mathcal E that maps the state representation of multiple agents into a single representative vector zRn\mathbf z \in \mathbb R^n.

For this, we introduce a Transformer-based coordination skill encoder, which learns to capture interactions among agents as well as interactions between agents and a manipulation object.

To obtain predictable skill representation space, we train the Transformer-based skill encoder as a prediction model of the agent’s future trajectories i.e., actions. In our network, the past context encoder is comprised of stacked multi-head self-attention layers that learn to attend to past trajectories across spatial and temporal domains. The future trajectory decoder is composed of multi-head self-attention layers and subsequent stacked multi-head cross-attention layers that integrate the past trajectory information and input tokens.

Retrieval-Augmented Policy Training

We assume that the prior dataset Dprior\mathcal D_{\text{prior}} and the target demonstrations Dtarget\mathcal D_{\text{target}} are composed as follows,

Dprior\mathcal D_{\text{prior}}The prior dataset encompasses a large-scale offline demonstration of diverse cooperative tasks. The dataset also includes noisy or sub-optimal demonstrations to mimic real-world scenarios. All demonstrations are task-agnostic, i.e., each data is stored without any specific task annotations.
Dtarget\mathcal D_{\text{target}}The target dataset encompasses a small amount of expert data, i.e., all data is composed of wellcoordinated demonstrations to complete the target task.

Given Dprior\mathcal D_{\text{prior}} and Dtarget\mathcal D_{\text{target}}, we aim to retrieve cooperative behaviors similar to those seen in Dtarget\mathcal D_{\text{target}} in the learned coordination skill space. Once we obtain the retrieved data Dret\mathcal D_{\text{ret}}, we train a multi-agent control policy using Dtarget\mathcal D_{\text{target}} augmented with Dret\mathcal D_{\text{ret}}. That is, the training data Dtrain\mathcal D_{\text{train}} is described as Dtrain=DtargetDret\mathcal D_{\text{train}} = \mathcal D_{\text{target}} \cup \mathcal D_{\text{ret}}.

Results

Quantitative Results on Simulated Demonstrations

We assume NN mobile robots are navigated to push an object toward a predefined goal state. To explore various coordination scenarios, we specifically focus on three different values of N{2,3,4}N \in \{2, 3, 4\}. For each of these setups, we collect demonstrations that involve four distinct tasks. These four tasks are carefully designed to encompass two different objects (stick or block) and manipulation difficulties (easy or hard).

Num of Agentstasksuccess rate ⬆️ [%]
objectleveltrajectory matchingfew-shot imitation learning✨ours✨
2🧱block

🪄stick

hard
easy
hard
easy
41.4±5.9
55.7±6.0
32.9±5.7
68.6±5.6
20.0±4.8
15.7±4.4
5.7±2.8
12.9±4.0
41.4±5.9
57.1±6.0
35.7±5.8

67.1±5.7
3🧱block

🪄stick

hard
easy
hard
easy
48.6±6.0
77.1±5.1
28.6±5.4
67.1±5.7
32.9±5.7
34.3±5.7
34.3±5.7
52.9±6.0
42.9±6.0
90.0±3.6
41.4±5.9

65.7±5.7
4🧱block

🪄stick

hard
easy
hard
easy
55.7±6.0
78.6±4.9
18.6±4.7
52.9±6.0
38.6±5.9
55.7±6.0
7.10±3.1
57.1±6.0
58.6±5.9
88.6±3.8
24.3±5.2
70.0±5.5
baselines
TRAJECTORY MATCHINGWe utilize a standard trajectory matching as a baseline to evaluate our retrieval method on the basis of multi-agent skill representation. By calculating the similarity of the robot’s xyxy -coordinate trajectories using FastDTW, we retrieve data from the Dprior\mathcal D_{\text{prior}} that have trajectories similar to the target data’s trajectories.
FEW-SHOT IMITATION LEARNINGWe compare our method with a few-shot adaptation method. We trained the multi-task policy from Dprior\mathcal D_{\text{prior}} and fine-tuned it using Dtarget\mathcal D_{\text{target}}.

The Table shows our retrieval-augmented policy training outperforms few-shot imitation learning and agent-wise trajectory matching. These results indicate that our approach is more effective in tasks requiring advanced robot coordination, particularly in scenarios with a larger number of robots or in more complex tasks.

Real-robot Experiments

To validate the efficacy of our method in the real world, we use demonstrations of real wheeled robots for querying the prior dataset DpriorD_{\text{prior}} constructed in simulated environments. We then train the policy using a few real-robot data augmented with the retrieved simulation demonstrations.

setting
We employ our custom-designed swarm robot platform, ``maru''. The wheeled microrobots communicate with the host computer using 2.4 GHz wireless communication. The positions of all robots and an object are tracked in real time by a high-speed digital light processing (DLP) structured light projector system. We can control the positions of the robots by sending 2D coordinates of their future locations from the host computer. We collect three demonstrations that naviate robots toward a given goal state using a hand-gesture system with Leap Motion, where the hands' movement is translated into the trajectories of multiple robots. Using the real-robot demonstrations as queries, we retrieve 300 simulated demonstrations from the prior dataset for each query.
Query
(Human Demonstration)
Baseline
(No Retrieved Data)
Ours
(Query+Retrieved)

We compare our policy trained using the real and retrieved demonstrations and the one trained using only the real-robot demonstrations. The results clearly demonstrate that the policy trained by our retrieve-and-learn framework successfully pushes an object closer to the goal state, while the policy trained using only a few real demonstrations (No Retrieved Data) fails to complete the task.

Acknowledgements

This work was supported by JST AIP Acceleration Research JPMJCR23U2, Japan.

Citation

@inproceedings{kuroki2024iros,
  title={Multi-Agent Behavior Retrieval: Retrieval-Augmented Policy Training for Cooperative Push Manipulation by Mobile Robots},
  author={So Kuroki, Mai Nishiura, Tadashi Kozuno},
  booktitle={2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  organization={IEEE}
  year={2024}
}