BrowseComp-V³ is a new benchmark with 300 challenging questions for evaluating multimodal browsing agents on deep multi-hop reasoning across text and visuals. It features subgoal-driven process evaluation and publicly searchable evidence for reproducibility. Experiments reveal state-of-the-art models achieve only 36% accuracy, highlighting integration bottlenecks.
Key Points
- 1.300 curated questions spanning diverse domains
- 2.Visual-textual multi-hop reasoning
- 3.OmniSeeker unified browsing agent framework
- 4.Subgoal-driven fine-grained evaluation
- 5.SOTA models at 36% accuracy
Impact Analysis
Exposes critical gaps in MLLM capabilities for real-world web search. Enables reproducible assessments and drives improvements in multimodal agents. Pushes boundaries beyond current benchmarks.
Technical Details
Evidence interleaved across modalities and web pages. Integrates web search and visual perception tools in OmniSeeker. Expert-validated for process analysis.