SWE-rebench-V2: Largest Open Coding Dataset

๐กLargest open multilingual coding dataset boosts RL training for code agentsโessential for beating leaderboards.
โก 30-Second TL;DR
What Changed
32,000+ executable tasks with Docker environments based on real issues
Why It Matters
This dataset enables scalable RL training for multilingual code agents, potentially accelerating open-source advancements in coding AI beyond Python-centric benchmarks. It lowers barriers for researchers training competitive models.
What To Do Next
Download SWE-rebench-V2 from Hugging Face and fine-tune your code agent on its 32k tasks.
๐ง Deep Insight
Web-grounded analysis with 9 cited sources.
๐ Enhanced Key Takeaways
- โขSWE-rebench V2 employs an interactive setup agent to synthesize repository-specific installation and test procedures, enabling reproducible Docker environments across 3,600+ repositories.[1]
- โขThe dataset integrates diagnostic metadata distinguishing model failures from environment issues like flaky tests, calibrated against human-verified SWE-bench annotations using LLM ensemble judges.[1]
- โขNebius developed custom Kubernetes infrastructure scaling to 8,000 parallel agent pods and TractoAI for storage, processing, and evaluation of thousands of solutions per experiment.[2]
๐ Competitor Analysisโธ Show
| Feature | SWE-rebench V2 | SWE-bench | SWE-Gym |
|---|---|---|---|
| Languages | 20 (language-agnostic) [1] | Primarily Python [3][7] | Python-only [1] |
| Tasks | 32k executable + 120k raw [1] | ~2k verified [7] | Executable Python tasks [1] |
| Benchmarks | swe-rebench.com leaderboard [1] | swebench.com leaderboard [7] | Trajectory data for RL [1] |
| Pricing | Open/free (Hugging Face) [1] | Open/free [7] | Open/free [1] |
๐ ๏ธ Technical Deep Dive
- โขAutomated pipeline: Interactive setup agent generates repo-specific Docker install/test procedures; LLM ensemble judges filter unsound tasks, validated on SWE-bench Verified (1,699 human-scored instances).[1][8]
- โขData processing: Map-reduce ops join issues/PRs, filter permissive licenses/new tests, split patches, compute metadata; uses TractoAI for filesystem-intensive ops, logs, and test statuses.[2]
- โขEvaluation infra: Kubernetes scales to 8k agent pods; evaluates 2,500+ solutions per SWE-bench Verified run (5 runs/task) in ~18 min on TractoAI cluster with prebuilt images.[2]
- โขQuality assessment: Multi-pass consensus like SPICE for issue clarity/test coverage; tracks issue/PR creation dates for contamination decontamination against model releases.[1][3]
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
๐ Sources (9)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ