Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

ACL 2026
1OMRON SINIC X Corporation2The University of Osaka* Work done as a research intern at OMRON SINIC X.

TL;DR We investigate whether compact language models (0.5–1B) can acquire sophisticated agentic retrieval-augmented generation (RAG) behavior. While existing agentic RAG systems rely on multi-billion-parameter models, compact models suffer from poor initial outputs, sparse rewards, and unstable reinforcement learning dynamics. To address these challenges, we propose Distillation-Guided Policy Optimization (DGPO), a two-phase training framework combining cold-start knowledge distillation and selective teacher-guided reinforcement learning.

Overview

This project introduces Distillation-Guided Policy Optimization (DGPO), a reinforcement learning framework designed to unlock agentic retrieval-augmented generation (RAG) capabilities in compact language models (0.5–1B parameters). DGPO stabilizes training via:

  1. Cold-Start Initialization with Knowledge Distillation (KD) using high-quality teacher-generated trajectories.
  2. Selective Teacher Guidance during RL—rewarding correct autonomous reasoning while penalizing incorrect outputs via KL divergence against the teacher.

DGPO achieves up to 55× improvement over base compact models and even surpasses its 3B-parameter teacher on several datasets.

Challenges in Compact Agentic RAG

Applying Reinforcement Learning (RL) to compact models presents unique challenges. Unlike larger models, compact models (e.g., 0.5B parameters) exhibit poor initial performance, resulting in sparse rewards and unstable training dynamics.

As shown in the figures below, smaller models lag significantly behind larger counterparts in agentic RAG tasks, and standard RL methods like PPO and GRPO often fail to improve performance or lead to early collapse.

Performance Gap across model sizes

Figure 1: Comparison of prompt-based and RL-based agentic RAG performance gaps.

Training instability in small models

Figure 2: Unstable training curves of 0.5B models using PPO and GRPO.

Distillation-Guided Policy Optimization (DGPO)

DGPO Framework Diagram

Figure 3: The DGPO pipeline transforms the reference model from a passive regularizer into an active pedagogical guide.

To overcome these challenges, we propose Distillation-Guided Policy Optimization (DGPO). Our framework operates in two key phases:

  1. Cold-Start KD Initialization: The student model is initialized by distilling from a teacher's correct trajectories (TGOs). This establishes a stable foundation for reasoning.
  2. Distillation-Guided RL: We employ a "mimic if wrong, reward if right" strategy.
    • Correct Answer: The student receives a reward (r=1) and updates its policy autonomously.
    • Incorrect Answer: The student is penalized via KL divergence to mimic the teacher's distribution, effectively using the teacher as an active guide for error correction.

The core of DGPO lies in its selective reward and penalty mechanism during the RL phase:

r(x,y)={1if y=y (Reward for correct answer)βDKL[πg(x)πθ(x)]if yy (Teacher mimicry if wrong)r(x, y) = \begin{cases} 1 & \text{if } y = y^* \text{ (Reward for correct answer)} \\ -\beta D_{KL}[\pi_g(\cdot|x) || \pi_\theta(\cdot|x)] & \text{if } y \neq y^* \text{ (Teacher mimicry if wrong)} \end{cases}

This objective allows the student to explore when confident but forces alignment with the teacher πg\pi_g when it fails.

Agentic RAG Capabilities (ARCap) Evaluation

We introduce Agentic RAG Capabilities (ARCap), a fine-grained metric to diagnose how models perform agentic search, rather than just checking final answer accuracy. ARCap evaluates three dimensions:

  • Thinking: The ability to plan search steps and synthesize evidence.
  • Query Rewriting: The ability to reformulate user questions into effective search queries.
  • Source Referencing: The ability to accurately cite and integrate retrieved documents.
Agentic RAG Capabilities

Figure 4: ARCap characterizes thinking, query rewriting, and source referencing capabilities.

Experimental Results

Overall Performance

We evaluated DGPO across seven benchmarks. DGPO consistently outperforms RL baselines (PPO) and distillation baselines. Remarkably, the 0.5B student model trained with DGPO approaches or even surpasses the 3B teacher model on datasets like NQ and HotpotQA.

MethodsNQTriviaQAPopQAHotpotQA2wikiMuSiQueBamboogleAvg.
🐣 Student-0.5b0.0040.0060.0070.0070.0150.0000.0000.006
🎓 Teacher-3b0.3650.5690.3930.3400.3680.1350.2980.353
PPO0.3060.4440.3790.2050.2180.0410.0730.238
GKD0.2660.4080.3580.2160.2170.0550.1610.240
SeqKD0.3310.4160.3640.2830.2730.0890.1690.275
KD0.3310.4310.3730.2860.2840.0910.2900.298
DistiLLM0.3330.4420.3730.2880.2700.0950.2090.287
TAID0.3250.4270.3650.2900.2700.0790.2180.282
DGPO (ours)0.378 🏅0.4810.402 🏅0.342 🏅0.3030.1200.2740.329

Table 1: Overall performance. Best and second-best results are highlighted. Bold 🏅 indicates outperforming the teacher.

Agentic RAG Capabilities (ARCap) Analysis

We utilized the ARCap framework to diagnose how DGPO improves agentic behaviors compared to the teacher and baselines.

  • (a) Source Referencing: DGPO provides strong information extraction when the correct evidence is directly available.
  • (b) Query Rewriting: DGPO achieves teacher-level query rewriting.
  • (c) Thinking: DGPO exhibits the strongest multi-hop reasoning by taking more search steps than the teacher model.
ModelsNQ (Single-hop)MuSiQue (Multi-hop)
w/o thinkw/ thinkw/o thinkw/ think
Student-0.5B0.3860.0340.1660.013
Teacher-3B0.5890.5600.4130.357
PPO0.5470.5810.2580.242
KD0.5400.5440.3210.256
DGPO (Ours)0.5650.5930.3120.287

Table 2: Source Referencing & Thinking Acc. (EM). Best and second-best results are highlighted.

ModelsNQ (1-hop)MuSiQue (Multi-hop)
Hit RatioHit RatioSearch Steps
Student-0.5B0.0040.0523.86
Teacher-3B0.6820.6681.60
PPO0.7110.5681.68
KD0.6750.5702.45
DGPO (Ours)0.6820.5832.64

Table 3: Query Rewriting & Search Efficiency. Best and second-best results are highlighted.

Training Stability

DGPO maintains stable learning curves well beyond where other methods collapse. As shown below, while GRPO and standard PPO struggle with the 0.5B model, DGPO sustains performance gains up to 1000 training steps.

Training Stability Curve

Figure 5: Training stability comparison. DGPO (red) maintains stable improvement while baselines collapse.

Acknowledgement

This work was supported by JST AIP Acceleration Research, Japan, Grant Number JPMJCR23U2 and JST PRESTO, Japan, Grant Number JPMJPR2518.

Citation

@inproceedings{kotoge2026dgpo,
    title = "Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities",
    author = "Kotoge, Rikuto and Nishimura, Mai and Ma, Jiaxin",
    booktitle = "Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2026",
    publisher = "Association for Computational Linguistics",
}
@inproceedings{kotoge2025democratizing,
    title = "Democratizing Agentic {RAG}: Distillation-Guided Policy Optimization for Compact Language Models",
    author = "Kotoge, Rikuto and Nishimura, Mai and Ma, Jiaxin",
    booktitle = "NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning",
    year = "2025",
    url = "https://openreview.net/forum?id=CP0H9NAWES",
}