AI Updates Aggregator

🦙Reddit r/LocalLLaMA•Mar 3, 2026Stalecollected in 70m

SWE-rebench-V2: Largest Open Coding Dataset

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#dataset #rl-training #multilingual-codeswe-rebench-v2

💡Largest open multilingual coding dataset boosts RL training for code agents—essential for beating leaderboards.

⚡ 30-Second TL;DR

What Changed

32,000+ executable tasks with Docker environments based on real issues

Why It Matters

This dataset enables scalable RL training for multilingual code agents, potentially accelerating open-source advancements in coding AI beyond Python-centric benchmarks. It lowers barriers for researchers training competitive models.

What To Do Next

Download SWE-rebench-V2 from Hugging Face and fine-tune your code agent on its 32k tasks.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 9 cited sources.

🔑 Enhanced Key Takeaways

•SWE-rebench V2 employs an interactive setup agent to synthesize repository-specific installation and test procedures, enabling reproducible Docker environments across 3,600+ repositories.[1]
•The dataset integrates diagnostic metadata distinguishing model failures from environment issues like flaky tests, calibrated against human-verified SWE-bench annotations using LLM ensemble judges.[1]
•Nebius developed custom Kubernetes infrastructure scaling to 8,000 parallel agent pods and TractoAI for storage, processing, and evaluation of thousands of solutions per experiment.[2]

📊 Competitor Analysis▸ Show

Feature	SWE-rebench V2	SWE-bench	SWE-Gym
Languages	20 (language-agnostic) [1]	Primarily Python [3][7]	Python-only [1]
Tasks	32k executable + 120k raw [1]	~2k verified [7]	Executable Python tasks [1]
Benchmarks	swe-rebench.com leaderboard [1]	swebench.com leaderboard [7]	Trajectory data for RL [1]
Pricing	Open/free (Hugging Face) [1]	Open/free [7]	Open/free [1]

🛠️ Technical Deep Dive

•Automated pipeline: Interactive setup agent generates repo-specific Docker install/test procedures; LLM ensemble judges filter unsound tasks, validated on SWE-bench Verified (1,699 human-scored instances).[1][8]
•Data processing: Map-reduce ops join issues/PRs, filter permissive licenses/new tests, split patches, compute metadata; uses TractoAI for filesystem-intensive ops, logs, and test statuses.[2]
•Evaluation infra: Kubernetes scales to 8k agent pods; evaluates 2,500+ solutions per SWE-bench Verified run (5 runs/task) in ~18 min on TractoAI cluster with prebuilt images.[2]
•Quality assessment: Multi-pass consensus like SPICE for issue clarity/test coverage; tracks issue/PR creation dates for contamination decontamination against model releases.[1][3]

🔮 Future ImplicationsAI analysis grounded in cited sources

SWE-rebench V2 enables RL training of multilingual SWE agents beyond Python dominance

Its 20-language coverage and 32k+ executable tasks address gaps in prior Python-centric datasets like SWE-Gym, supporting scalable agent development.[1]

Leaderboard standardizes evaluations reducing self-reported benchmark inflation

Fixed ReAct framework, 128k token context, and team-run evals with contamination tracking provide reliable, comparable SWE agent performance metrics.[3]

Nebius infrastructure accelerates SWE agent iteration cycles

Custom pipelines for dataset building and 8k-pod scaling enable rapid experimentation, as shown by SOTA open-weight agents hitting 40.6% on SWE-bench Verified.[2][6]

⏳ Timeline

2025-09

Nebius AI R&D begins SWE agent research, develops data collection pipelines and TractoAI integration.[2]

2025-09-28

Releases SWE-bench-extra (6,411 tasks) and SWE-agent-trajectories on Hugging Face for training.[6][9]

2025

Publishes original SWE-rebench paper with 21k+ Python tasks and contamination-free benchmark.[4]

2025

Launches swe-rebench.com with standardized leaderboard and continuous task mining pipeline.[3]

2026-02

Releases SWE-rebench V2 paper on arXiv, expanding to 32k+ tasks across 20 languages.[1]

2026-03

Nebius announces SWE-rebench-V2 publicly with technical report and Discord leaderboard.[article]

📎 Sources (9)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🦙Read original article on Reddit r/LocalLLaMA

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #dataset

Same product