DIVE Scales Diversity for Tool-Use Generalization

Post LinkedIn

📄Read original on ArXiv AI

#agentic-tasks #tool-use #ood-generalizationdive

💡Diversity > quantity: +22pts OOD tool-use benchmarks w/ 4x less data

⚡ 30-Second TL;DR

What Changed

Inverts synthesis: execute tools first, derive tasks from traces

Why It Matters

DIVE addresses brittleness in agentic LLMs by prioritizing diversity, enabling robust tool-use generalization with less data. This shifts training paradigms toward quality over quantity, benefiting scalable agent development.

What To Do Next

Download DIVE dataset from arXiv:2603.11076 and fine-tune your tool-using LLM for OOD gains.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 8 cited sources.

🔑 Enhanced Key Takeaways

•DIVE employs an Evidence Collection-Task Derivation loop to generate executable tasks from tool execution traces, ensuring verifiability while expanding coverage to edge cases in tool usage.[8]
•The dataset covers five domains including web browsing, file management, database querying, code execution, and multimedia processing, enabling broad real-world applicability.[8]
•Training incorporates 48k supervised fine-tuning examples followed by 3.2k reinforcement learning steps using PPO on tool-use specific rewards.

🛠️ Technical Deep Dive

•Evidence Collection phase involves executing diverse tool combinations across 373 tools, logging traces including inputs, outputs, errors, and intermediate states.
•Task Derivation uses trace analysis to infer entailed tasks, such as 'query database for user info' from a successful SQL execution trace.
•Diversity scaling achieved via submodular selection for tool-pool coverage and per-task variety, prioritizing underrepresented execution paths.
•Evaluation on 9 out-of-distribution benchmarks including ToolBench, API-Bank, and custom held-out tool sets shows +22 average point gain over baselines.

🔮 Future ImplicationsAI analysis grounded in cited sources

DIVE will raise the state-of-the-art for tool-use generalization by 15-20% on new benchmarks within 12 months

Its diversity-focused data generation outperforms quantity-based scaling even with 4x less data, providing a scalable recipe for future agent training.

Open-sourcing DIVE datasets will accelerate open model tool capabilities closing the gap to proprietary agents

Demonstrated gains on accessible Qwen3-8B suggest similar boosts for other open models via transfer learning on the released traces.

⏳ Timeline

2026-03

DIVE paper released on arXiv introducing inverted task synthesis for tool-use generalization

📎 Sources (8)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

📄Read original article on ArXiv AI

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #agentic-tasks

Same product