Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities
ACL 2026
TL;DR We investigate whether compact language models (0.5–1B) can acquire sophisticated agentic retrieval-augmented generation (RAG) behavior. While existing agentic RAG systems rely on multi-billion-parameter models, compact models suffer from poor initial outputs, sparse rewards, and unstable reinforcement learning dynamics. To address these challenges, we propose Distillation-Guided Policy Optimization (DGPO), a two-phase training framework combining cold-start knowledge distillation and selective teacher-guided reinforcement learning.
This project introduces Distillation-Guided Policy Optimization (DGPO), a reinforcement learning framework designed to unlock agentic retrieval-augmented generation (RAG) capabilities in compact language models (0.5–1B parameters). DGPO stabilizes training via:
DGPO achieves up to 55× improvement over base compact models and even surpasses its 3B-parameter teacher on several datasets.
Applying Reinforcement Learning (RL) to compact models presents unique challenges. Unlike larger models, compact models (e.g., 0.5B parameters) exhibit poor initial performance, resulting in sparse rewards and unstable training dynamics.
As shown in the figures below, smaller models lag significantly behind larger counterparts in agentic RAG tasks, and standard RL methods like PPO and GRPO often fail to improve performance or lead to early collapse.



To overcome these challenges, we propose Distillation-Guided Policy Optimization (DGPO). Our framework operates in two key phases:
The core of DGPO lies in its selective reward and penalty mechanism during the RL phase:
This objective allows the student to explore when confident but forces alignment with the teacher when it fails.
We introduce Agentic RAG Capabilities (ARCap), a fine-grained metric to diagnose how models perform agentic search, rather than just checking final answer accuracy. ARCap evaluates three dimensions:

We evaluated DGPO across seven benchmarks. DGPO consistently outperforms RL baselines (PPO) and distillation baselines. Remarkably, the 0.5B student model trained with DGPO approaches or even surpasses the 3B teacher model on datasets like NQ and HotpotQA.
| Methods | NQ | TriviaQA | PopQA | HotpotQA | 2wiki | MuSiQue | Bamboogle | Avg. |
|---|---|---|---|---|---|---|---|---|
| 🐣 Student-0.5b | 0.004 | 0.006 | 0.007 | 0.007 | 0.015 | 0.000 | 0.000 | 0.006 |
| 🎓 Teacher-3b | 0.365 | 0.569 | 0.393 | 0.340 | 0.368 | 0.135 | 0.298 | 0.353 |
| PPO | 0.306 | 0.444 | 0.379 | 0.205 | 0.218 | 0.041 | 0.073 | 0.238 |
| GKD | 0.266 | 0.408 | 0.358 | 0.216 | 0.217 | 0.055 | 0.161 | 0.240 |
| SeqKD | 0.331 | 0.416 | 0.364 | 0.283 | 0.273 | 0.089 | 0.169 | 0.275 |
| KD | 0.331 | 0.431 | 0.373 | 0.286 | 0.284 | 0.091 | 0.290 | 0.298 |
| DistiLLM | 0.333 | 0.442 | 0.373 | 0.288 | 0.270 | 0.095 | 0.209 | 0.287 |
| TAID | 0.325 | 0.427 | 0.365 | 0.290 | 0.270 | 0.079 | 0.218 | 0.282 |
| DGPO (ours) | 0.378 🏅 | 0.481 | 0.402 🏅 | 0.342 🏅 | 0.303 | 0.120 | 0.274 | 0.329 |
We utilized the ARCap framework to diagnose how DGPO improves agentic behaviors compared to the teacher and baselines.
| Models | NQ (Single-hop) | MuSiQue (Multi-hop) | ||
|---|---|---|---|---|
| w/o think | w/ think | w/o think | w/ think | |
| Student-0.5B | 0.386 | 0.034 | 0.166 | 0.013 |
| Teacher-3B | 0.589 | 0.560 | 0.413 | 0.357 |
| PPO | 0.547 | 0.581 | 0.258 | 0.242 |
| KD | 0.540 | 0.544 | 0.321 | 0.256 |
| DGPO (Ours) | 0.565 | 0.593 | 0.312 | 0.287 |
| Models | NQ (1-hop) | MuSiQue (Multi-hop) | |
|---|---|---|---|
| Hit Ratio | Hit Ratio | Search Steps | |
| Student-0.5B | 0.004 | 0.052 | 3.86 |
| Teacher-3B | 0.682 | 0.668 | 1.60 |
| PPO | 0.711 | 0.568 | 1.68 |
| KD | 0.675 | 0.570 | 2.45 |
| DGPO (Ours) | 0.682 | 0.583 | 2.64 |
DGPO maintains stable learning curves well beyond where other methods collapse. As shown below, while GRPO and standard PPO struggle with the 0.5B model, DGPO sustains performance gains up to 1000 training steps.

This work was supported by JST AIP Acceleration Research, Japan, Grant Number JPMJCR23U2 and JST PRESTO, Japan, Grant Number JPMJPR2518.
@inproceedings{kotoge2026dgpo,
title = "Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities",
author = "Kotoge, Rikuto and Nishimura, Mai and Ma, Jiaxin",
booktitle = "Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2026",
publisher = "Association for Computational Linguistics",
}
@inproceedings{kotoge2025democratizing,
title = "Democratizing Agentic {RAG}: Distillation-Guided Policy Optimization for Compact Language Models",
author = "Kotoge, Rikuto and Nishimura, Mai and Ma, Jiaxin",
booktitle = "NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning",
year = "2025",
url = "https://openreview.net/forum?id=CP0H9NAWES",
}