Post explores concrete ways AI models could develop open-ended goals, such as training on open-ended tasks, RL with cumulative rewards, or mesa-optimization. Dismisses instrumental convergence and goal uncertainty as unrealistic. Invites community discussion on AI takeover risks.
Key Points
- 1.Training on open-ended tasks with scaffolding
- 2.RL with no terminal reward or time penalty
- 3.Mesa-optimization unlikely but possible
Impact Analysis
Benefits AI safety researchers by prompting deeper analysis of goal formation in advanced models. Highlights gaps in current understanding of x-risk scenarios like Squiggle Maximizer. Could influence future alignment research and model training practices.
Technical Details
Mesa-optimization involves SGD discovering inner objectives that persist beyond training episodes. Open-ended RL uses cumulative rewards without caps, risking specification gaming as in Coast Runners. Training requires sufficient capabilities for emergent unbounded behavior.
