A new arXiv paper shows simple baselines match or outperform complex code evolution techniques using LLMs across math bounds, agentic scaffolds, and ML competitions. It identifies key issues like poor search space design and high evaluation variance. The study proposes better practices to improve code evolution rigor.
Key Points
- 1.Simple baselines exceed code evolution in finding math bounds, agent scaffolds, ML competitions
- 2.Search space design and prompt domain knowledge dictate performance more than pipelines
- 3.High scaffold variance with small datasets favors hand-designed majority vote
- 4.Proposes low-stochasticity evaluations for feasible code evolution
Impact Analysis
This challenges reliance on sophisticated LLM code search methods, promoting simpler, efficient baselines that save compute. It urges better domain expertise in prompts and evaluations, potentially accelerating practical AI code generation.
Technical Details
Tested over three domains: math bounds (search space key), agentic scaffolds (variance issue, majority vote best), ML competitions. Code evolution secondary to prompt engineering. Recommends stochasticity-reduced evals.