Co-rewarding: Label-Free Stable RL for LLM Reasoning
🧠#self-supervised-rl#reward-hacking#llm-reasoningFreshcollected in 7m

Co-rewarding: Label-Free Stable RL for LLM Reasoning

PostLinkedIn
🧠Read original on 机器之心

💡ICLR 2026 breakthrough: stable label-free RL for LLMs, beats reward hacking.

⚡ 30-Second TL;DR

What changed

ICLR 2026 acceptance for Co-rewarding self-supervised RL framework

Why it matters

Enables scalable RL training for reasoning without expensive annotations, potentially accelerating LLM advancements for researchers and developers facing data scarcity.

What to do next

Read the paper at https://openreview.net/forum?id=fDk95XPsCU and test Co-rewarding on your LLM RL pipeline.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 6 cited sources.

🔑 Key Takeaways

  • Co-rewarding is a self-supervised RL framework for LLM reasoning, accepted to ICLR 2026, developed by researchers from Hong Kong Baptist University and Shanghai Jiao Tong University.
  • It addresses RLVR annotation bottlenecks and self-rewarding failures by using complementary self-supervised signals from data and model views, eliminating the need for ground-truth labels.
  • The framework prevents reward hacking and training collapse, inducing stable reasoning capabilities in large language models.
📊 Competitor Analysis▸ Show
MethodKey FeatureBenchmarks
Co-rewardingLabel-free self-supervised RL with complementary signalsPrevents collapse in LLM reasoning (no specific benchmarks in results)
REA-RLReflection-aware online RL for efficient reasoning modelsICLR 2026 acceptance [2]
SMORMJoint Bradley-Terry and multi-objective reward modelingOutperforms 70B baseline with 7B model in OOD settings [4]
IntDiffIntrinsic rewards for diffusion model fine-tuningImproves alignment and diversity in text-to-image [5]
LongRContextual dense rewards for long-context reasoning~4% gain over outcome-only baselines [3]

🛠️ Technical Deep Dive

  • Co-rewarding employs complementary self-supervised signals from data/model perspectives to stabilize RL training without labels, targeting reward hacking prevention in LLM reasoning induction.
  • No specific model architecture or implementation details (e.g., code links) found in search results beyond original article mention.
  • Related works: LongR uses Relative Information Gain (white-box metric) and interleaved Think-and-Read policy with curriculum learning for long-context RLVR[3].
  • REA-RL provides GitHub implementation for reflection-aware online RL, accepted to ICLR 2026[2].

🔮 Future ImplicationsAI analysis grounded in cited sources

Co-rewarding advances label-free RL for LLMs, potentially reducing annotation costs and improving reasoning stability amid growing ICLR focus on reward modeling robustness, enabling scalable self-improvement in reasoning models.

📎 Sources (6)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

  1. papercopilot.com
  2. github.com
  3. arxiv.org
  4. openreview.net
  5. openreview.net
  6. microsoft.com

Hong Kong Baptist University and Shanghai Jiao Tong University researchers propose Co-rewarding, a self-supervised RL framework accepted to ICLR 2026. It stabilizes training without ground-truth labels by using complementary self-supervised signals, preventing reward hacking and collapse in LLM reasoning induction. Addresses RLVR's annotation bottleneck and self-rewarding failures.

Key Points

  • 1.ICLR 2026 acceptance for Co-rewarding self-supervised RL framework
  • 2.Eliminates need for labeled data in RLVR by complementary signals from data/model views
  • 3.Prevents training collapse and reward hacking in self-rewarding LLMs
  • 4.Induces stable reasoning capabilities in large language models
  • 5.Paper and code links provided for replication

Impact Analysis

Enables scalable RL training for reasoning without expensive annotations, potentially accelerating LLM advancements for researchers and developers facing data scarcity.

Technical Details

Co-rewarding introduces mutual complementary self-supervised signals at data or model ends to stabilize reward acquisition and raise reward speculation difficulty, avoiding RL collapse.

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Read Next

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 机器之心