Study reveals LLM agents like Llama/GPT/Claude produce 2-4 unique action paths per 10 runs on HotpotQA, with inconsistency predicting failure. Consistent runs hit 80-92% accuracy vs 25-60% for inconsistent ones. Variance traces to early decisions like first search query.
Key Points
- 1.LLM agents produce 2-4 unique action paths per 10 HotpotQA runs
- 2.Inconsistency predicts failure, consistent runs at 80-92% accuracy vs 25-60%
- 3.Variance traces to early decisions like first search query
Impact Analysis
Researchers and LLM agent developers benefit by gaining insights into behavioral variance as a failure predictor. It matters because it highlights the performance gap between consistent and inconsistent runs, urging focus on stabilizing early decisions. Potential effects include improved agent training for higher reliability and accuracy in multi-step tasks.
Technical Details
Study ran LLM agents (Llama, GPT, Claude) 10 times each on HotpotQA to count unique action paths. Inconsistency strongly correlated with lower success rates. Root cause analysis pinpointed early actions, such as the initial search query, as primary variance sources.