DRPO tackles model collapse in off-policy generative recommendation via optimistic distributionally robust optimization. Proves hard filtering recovers high-quality data from noisy logs. Achieves SOTA on mixed-quality benchmarks.
Key Points
- 1.Divergence theory explains repulsive optimization curse
- 2.Hard filtering as exact DRO solution
- 3.Breaks noise imitation-variance tradeoff
Impact Analysis
Improves RL-based sequential recommendation from offline data. Mitigates low-quality data dominance in real-world logs. Boosts performance in e-commerce and content systems.
Technical Details
Reformulates as optimistic DRO problem. Theoretical guarantees on noise discardance. arXiv:2602.10430v1.