RLCER reinforces chain-of-thought via self-evolving rubrics without human labels. Outperforms outcome-centric RLVR on reasoning tasks. Rubrics boost inference as prompts.
Key Points
- 1.Autonomous CoT supervision
- 2.No annotation needed
- 3.Handles evolving distributions
Impact Analysis
Enables scalable LLM reasoning improvement autonomously.