Proposes a trainable framework for interactive in-context learning using multi-turn feedback from information asymmetry on verifiable tasks. Trained smaller models nearly match performance of 10x larger models and generalize to coding, puzzles, and mazes. Enables self-improvement by internally modeling teacher critiques.
Key Points
- 1.Transforms single-turn tasks into multi-turn didactic interactions
- 2.Flagship LLMs struggle with corrective feedback on hard reasoning
- 3.Smaller trained models rival 10x larger models' multi-turn performance
- 4.Generalizes OOD to coding, puzzles, maze navigation
- 5.Self-corrects by predicting teacher critiques
Impact Analysis
Reduces reliance on massive models by enhancing smaller ones' adaptability. Promotes efficient, generalizable AI for diverse applications. Paves way for autonomous self-improving systems without external teachers.
Technical Details
Scalable method creates feedback loops via verifiable tasks with info asymmetry. Trains in-context plasticity for dynamic adaptation. Converts external feedback to internal self-correction via critique prediction.
