Autonomous agentic workflows exhibit optimization instability, where iterative self-improvement degrades classifier performance, especially for low-prevalence clinical symptoms like Long COVID brain fog (3%). Using the open-source Pythia framework, validation sensitivity oscillated wildly between 1.0 and 0.0. A selector agent that retrospectively picks the best iteration outperformed guiding agents and expert lexicons by 331% F1 on brain fog.
Key Points
- 1.Optimization instability causes performance oscillation inversely proportional to class prevalence
- 2.At 3% prevalence, achieved 95% accuracy but detected zero positives, fooling metrics
- 3.Selector agent oversight beats guiding agent and expert lexicons (331% F1 gain on brain fog)
- 4.Tested on shortness of breath (23%), chest pain (12%), Long COVID brain fog (3%)
Impact Analysis
Exposes hidden risks in autonomous AI for medical tasks, where high accuracy masks total failure on rares. Selector agents offer practical stabilization without heavy intervention, boosting reliability in imbalanced datasets.
Technical Details
Pythia enables automated prompt optimization; guiding agent amplified overfitting while selector identified peak iterations. Evaluated three symptoms showing severity scales with rarity; selector yielded 7% chest pain and 331% brain fog F1 gains over lexicons.