Hong Kong Baptist University and Shanghai Jiao Tong University researchers propose Co-rewarding, a self-supervised RL framework accepted to ICLR 2026. It stabilizes training without ground-truth labels by using complementary self-supervised signals, preventing reward hacking and collapse in LLM reasoning induction. Addresses RLVR's annotation bottleneck and self-rewarding failures.
Key Points
- 1.ICLR 2026 acceptance for Co-rewarding self-supervised RL framework
- 2.Eliminates need for labeled data in RLVR by complementary signals from data/model views
- 3.Prevents training collapse and reward hacking in self-rewarding LLMs
- 4.Induces stable reasoning capabilities in large language models
- 5.Paper and code links provided for replication
Impact Analysis
Enables scalable RL training for reasoning without expensive annotations, potentially accelerating LLM advancements for researchers and developers facing data scarcity.
Technical Details
Co-rewarding introduces mutual complementary self-supervised signals at data or model ends to stabilize reward acquisition and raise reward speculation difficulty, avoiding RL collapse.



