🧠Stalecollected in 6m

Beihang's Code2Bench Ends Code Model Score Inflation

Beihang's Code2Bench Ends Code Model Score Inflation
PostLinkedIn
🧠Read original on 机器之心

💡New ICLR'26 benchmark kills code LLM high-score illusions—test your model now

⚡ 30-Second TL;DR

What Changed

Addresses data contamination turning evals into 'open-book' memory tests

Why It Matters

Revolutionizes code LLM evaluation by making benchmarks evolve dynamically, forcing models to prove real reasoning over memorization. Expected to become standard for fair comparisons.

What To Do Next

Run your code LLM on the Code2Bench leaderboard at code2bench.github.io to benchmark generalization.

Who should care:Researchers & Academics

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

  • Code2Bench, developed by Beihang University researchers, is an open-sourced dynamic benchmark for code LLMs that generates fresh coding problems on-the-fly to prevent data contamination, as detailed on its official GitHub site code2bench.github.io.
  • It employs dual scaling mechanisms: problem complexity scaling by expanding source problem variants and test rigor scaling with diverse edge cases, addressing weak evaluation rigor in existing benchmarks like HumanEval.
  • The benchmark features a fully automated end-to-end framework for problem generation, execution, and evaluation, eliminating human biases and the 'illusion of correctness' from memorized solutions.
  • Accepted to ICLR 2026, Code2Bench has a live leaderboard ranking top code models like GPT-4o, Claude 3.5 Sonnet, and DeepSeek-Coder-V2, revealing significant performance drops compared to contaminated benchmarks.
  • Early leaderboard results show top models scoring 20-40% lower than on static benchmarks, validating its effectiveness in measuring true generalization capabilities.
📊 Competitor Analysis▸ Show
BenchmarkDynamic GenerationContamination ResistanceScaling MechanismLeaderboardOpen Source
Code2BenchYesDual ScalingYes (Problem + Test Rigor)Live at code2bench.github.ioYes
HumanEvalNoLowNoStaticYes
MBPPNoMediumNoStaticYes
LiveCodeBenchPartialHighPartialLiveYes
SWE-BenchNoHighNoLiveYes

🛠️ Technical Deep Dive

  • Problem Generation: Uses a seed set of 236 core problems; applies dual scaling with 5 complexity levels (expanding inputs/outputs) and 4 test rigor levels (unit tests, edge cases, multi-step verification).
  • Automation Pipeline: LLM-driven generation via GPT-4, followed by execution in isolated Docker environments with Python 3.10, supporting libraries like numpy, pandas.
  • Evaluation Metrics: Pass@1, Pass@10 under temperature=0; strict equivalence checking with hidden test cases to prevent leakage.
  • Contamination Mitigation: Problems regenerated periodically; checks against training data corpora like The Stack v2.
  • Implementation: Built with Python, available at github.com/code2bench/code2bench; supports custom model integration via OpenAI/VLLM APIs.

🔮 Future ImplicationsAI analysis grounded in cited sources

Code2Bench sets a new standard for code LLM evaluation, pressuring model developers to prioritize generalization over memorization. Expect widespread adoption in industry benchmarks, influencing model training paradigms and reducing hype-driven score inflation. It may accelerate progress in robust code generation while exposing gaps in current SOTA models.

Timeline

2025-11
Beihang researchers submit Code2Bench paper to ICLR 2026
2026-01
Paper accepted to ICLR 2026; GitHub repository and leaderboard launched
2026-02
Official release and coverage by 机器之心; initial leaderboard populated with top models
📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 机器之心