๐Ÿฆ™Stalecollected in 70m

SWE-rebench-V2: Largest Open Coding Dataset

SWE-rebench-V2: Largest Open Coding Dataset
PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กLargest open multilingual coding dataset boosts RL training for code agentsโ€”essential for beating leaderboards.

โšก 30-Second TL;DR

What Changed

32,000+ executable tasks with Docker environments based on real issues

Why It Matters

This dataset enables scalable RL training for multilingual code agents, potentially accelerating open-source advancements in coding AI beyond Python-centric benchmarks. It lowers barriers for researchers training competitive models.

What To Do Next

Download SWE-rebench-V2 from Hugging Face and fine-tune your code agent on its 32k tasks.

Who should care:Researchers & Academics

๐Ÿง  Deep Insight

Web-grounded analysis with 9 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขSWE-rebench V2 employs an interactive setup agent to synthesize repository-specific installation and test procedures, enabling reproducible Docker environments across 3,600+ repositories.[1]
  • โ€ขThe dataset integrates diagnostic metadata distinguishing model failures from environment issues like flaky tests, calibrated against human-verified SWE-bench annotations using LLM ensemble judges.[1]
  • โ€ขNebius developed custom Kubernetes infrastructure scaling to 8,000 parallel agent pods and TractoAI for storage, processing, and evaluation of thousands of solutions per experiment.[2]
๐Ÿ“Š Competitor Analysisโ–ธ Show
FeatureSWE-rebench V2SWE-benchSWE-Gym
Languages20 (language-agnostic) [1]Primarily Python [3][7]Python-only [1]
Tasks32k executable + 120k raw [1]~2k verified [7]Executable Python tasks [1]
Benchmarksswe-rebench.com leaderboard [1]swebench.com leaderboard [7]Trajectory data for RL [1]
PricingOpen/free (Hugging Face) [1]Open/free [7]Open/free [1]

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขAutomated pipeline: Interactive setup agent generates repo-specific Docker install/test procedures; LLM ensemble judges filter unsound tasks, validated on SWE-bench Verified (1,699 human-scored instances).[1][8]
  • โ€ขData processing: Map-reduce ops join issues/PRs, filter permissive licenses/new tests, split patches, compute metadata; uses TractoAI for filesystem-intensive ops, logs, and test statuses.[2]
  • โ€ขEvaluation infra: Kubernetes scales to 8k agent pods; evaluates 2,500+ solutions per SWE-bench Verified run (5 runs/task) in ~18 min on TractoAI cluster with prebuilt images.[2]
  • โ€ขQuality assessment: Multi-pass consensus like SPICE for issue clarity/test coverage; tracks issue/PR creation dates for contamination decontamination against model releases.[1][3]

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

SWE-rebench V2 enables RL training of multilingual SWE agents beyond Python dominance
Its 20-language coverage and 32k+ executable tasks address gaps in prior Python-centric datasets like SWE-Gym, supporting scalable agent development.[1]
Leaderboard standardizes evaluations reducing self-reported benchmark inflation
Fixed ReAct framework, 128k token context, and team-run evals with contamination tracking provide reliable, comparable SWE agent performance metrics.[3]
Nebius infrastructure accelerates SWE agent iteration cycles
Custom pipelines for dataset building and 8k-pod scaling enable rapid experimentation, as shown by SOTA open-weight agents hitting 40.6% on SWE-bench Verified.[2][6]

โณ Timeline

2025-09
Nebius AI R&D begins SWE agent research, develops data collection pipelines and TractoAI integration.[2]
2025-09-28
Releases SWE-bench-extra (6,411 tasks) and SWE-agent-trajectories on Hugging Face for training.[6][9]
2025
Publishes original SWE-rebench paper with 21k+ Python tasks and contamination-free benchmark.[4]
2025
Launches swe-rebench.com with standardized leaderboard and continuous task mining pipeline.[3]
2026-02
Releases SWE-rebench V2 paper on arXiv, expanding to 32k+ tasks across 20 languages.[1]
2026-03
Nebius announces SWE-rebench-V2 publicly with technical report and Discord leaderboard.[article]
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—