TSR introduces trajectory-search rollouts to enhance multi-turn reinforcement learning for LLM agents. It uses lightweight tree-style search for high-quality trajectories, improving rollout generation and stabilizing training. Achieves up to 15% performance gains on tasks like Sokoban and WebShop.
Key Points
- 1.Lightweight search for better per-turn actions
- 2.Optimizer-agnostic, pairs with PPO/GRPO
- 3.15% gains, stable learning on sparse rewards
Impact Analysis
TSR shifts search to training rollouts, enabling stronger multi-turn agents efficiently. Complements existing RL methods, reducing mode collapse in stochastic environments. Potential for broader LLM agent adoption in complex tasks.
Technical Details
Implements best-of-N, beam, and shallow lookahead search using task feedback. Tested on Sokoban, FrozenLake, WebShop with one-time compute increase. Leaves optimization objective unchanged.