Meta-Experience Learning (MEL) enhances RLVR by internalizing error-derived meta-experience into LLM memory. Uses self-verification for contrastive analysis of trajectories. Achieves 3.92%-4.73% Pass@1 gains across model sizes.
Key Points
- 1.Identifies bifurcation points in errors
- 2.Internalizes via NLL minimization
- 3.Improves reasoning benchmarks consistently
Impact Analysis
Overcomes RLVR limitations in credit assignment. Enables reusable knowledge from errors. Scales to larger LLMs for better fine-grained learning.
Technical Details
Builds on RLVR with self-distilled meta-experience. Bridges correct/incorrect trajectories via language rewards. Leverages LLM self-verification.