CUHK and Meituan researchers developed Agent-RRM, a reward model that scores entire Agent trajectories for reasoning quality and tool use, beyond just final outcomes. They curated a dataset of annotated traces and built the Reagent framework to integrate textual feedback and scores into training. This addresses sparse rewards in multi-step tasks like web search and coding.
Key Points
- 1.Agent-RRM evaluates full trajectories, outputting analysis, critique, and 0-1 process score.
- 2.Dataset of real Agent traces annotated to distinguish solid reasoning from lucky outcomes.
- 3.Reagent framework unifies text feedback and scalar rewards for Agent RL training.
- 4.Differentiates flawed execution from poor planning in complex, multi-modal tasks.
Impact Analysis
Provides dense supervision for long-horizon Agent tasks, potentially improving reasoning and tool proficiency without hand-crafted rules. Enables scalable training in open environments.
Technical Details
Agent-RRM trained on trajectories labeled with detailed 'grading' feedback, processes full traces including thoughts and tool calls. Outputs include internal analysis, simplified critique for Agents, and composite score distinguishing process quality.



