ResearchGym introduces a benchmark with five containerized environments from ICML, ICLR, and ACL papers, totaling 39 sub-tasks, where agents propose hypotheses and run experiments to beat baselines. GPT-5-powered agents show a capability-reliability gap, succeeding in only 6.7% of evaluations and completing 26.5% of sub-tasks. It identifies key failure modes like impatience and poor resource management, while occasionally achieving SOTA results unreliably.
Key Points
- 1.Repurposes 5 top conference papers into 39 sub-tasks, withholding proposed methods
- 2.GPT-5 agent improves baselines in 1/15 evals (6.7%), completes 26.5% sub-tasks
- 3.Failure modes: impatience, overconfidence, poor parallel experiment coordination
- 4.Occasionally surpasses ICML 2025 Spotlight SOTA in single runs
- 5.Evaluates Claude Code (Opus-4.5) and Codex (GPT-5.2) scaffolds
Impact Analysis
This benchmark standardizes evaluation of AI agents on real research, exposing reliability gaps in frontier models. It drives improvements in long-horizon planning and resource management for autonomous research agents.
Technical Details
Environments preserve datasets, eval harnesses, and baselines in containers. Agents must hypothesize, experiment, and beat paper metrics without original methods. Tests closed-loop research with context length limits.

