Ant's UI-Venus-1.5 Tops SOTA for GUI Agents

💡Open-source SOTA GUI agent runs 40+ Chinese apps end-to-end
⚡ 30-Second TL;DR
What Changed
SOTA performance in GUI agent benchmarks
Why It Matters
Accelerates deployable GUI agents for real apps, lowering barriers for practical AI assistants in mobile/web automation.
What To Do Next
Clone UI-Venus-1.5 from GitHub and fine-tune for custom Chinese app automation.
🧠 Deep Insight
Web-grounded analysis with 5 cited sources.
🔑 Enhanced Key Takeaways
- •UI-Venus-1.5-30B-A3B achieves state-of-the-art performance across multiple GUI agent benchmarks, reaching 77.6% on AndroidWorld, 76.4% on OSWorld-G-R, and 21.5% on VenusBench-Mobile, consistently outperforming specialized GUI models like MAI-UI-32B and general-purpose VLMs like GPT-4o[1][2]
- •The model employs a sophisticated multi-stage training pipeline featuring a massive mid-training phase utilizing 10 billion tokens across 30+ datasets to establish foundational GUI semantics, followed by scaled online reinforcement learning with full-trajectory rollouts[2]
- •UI-Venus-1.5 is built on the Qwen3-VL architecture and uses a model merge strategy that synthesizes specialized grounding, web, and mobile capabilities into a single unified checkpoint, enabling platform-agnostic operation across native apps, websites, and remote desktops[2]
- •The unified end-to-end approach effectively addresses the disparity between individual step accuracy and overall task completion rates by validating full interaction sequences rather than optimizing isolated steps[2]
- •Available in multiple parameter scales (2B, 8B, and 30B-A3B variants) with corresponding performance improvements, demonstrating scalable architecture suitable for diverse deployment scenarios[1]
📊 Competitor Analysis▸ Show
| Model | Developer | Key Benchmark (AndroidWorld) | Architecture | Training Approach |
|---|---|---|---|---|
| UI-Venus-1.5-30B-A3B | Ant Group | 77.6% | Qwen3-VL based | Mid-training (10B tokens) + Online RL with full-trajectory rollouts |
| MAI-UI-32B | Competitor | 73.9% (OSWorld-G-R) | Specialized GUI | Task-specific optimization |
| GTA1-32B | Competitor | 72.2% (OSWorld-G-R) | Specialized GUI | Task-specific optimization |
| GPT-4o | OpenAI | Comparable/lower on GUI tasks | General-purpose VLM | General vision-language pretraining |
| Qwen3-VL | Alibaba | General VLM baseline | General-purpose VLM | General vision-language pretraining |
🛠️ Technical Deep Dive
• Architecture Foundation: Built on Qwen3-VL, a large vision-language model, adapted specifically for GUI understanding and navigation tasks • Multi-Stage Training Pipeline: (1) Mid-training phase with 10 billion tokens across 30+ datasets for GUI semantic understanding; (2) Online reinforcement learning with full-trajectory rollouts for long-horizon task alignment • Model Merge Strategy: Synthesizes three specialized capabilities—grounding, web navigation, and mobile interaction—into a single unified checkpoint through intelligent model merging • Platform Agnosticism: Operates at pixel level, decoupling AI from operating system specifics, enabling operation across native applications, web interfaces, and remote desktops • Performance Variants: Available in 2B (8.7% VenusBench-Mobile success), 8B (16.1%), and 30B-A3B (21.5%) parameter scales with proportional performance improvements • Benchmark Coverage: Evaluated on AndroidWorld (77.6%), AndroidLab (55.1%/68.1%), OSWorld-G-R (76.4%), OSWorld-G (70.6%), VenusBench-Mobile (21.5%), and WebVoyager (76.0%) • Key Innovation: Addresses step-level vs. task-level accuracy disparity by optimizing full interaction sequences rather than individual steps, improving real-world task completion rates
🔮 Future ImplicationsAI analysis grounded in cited sources
UI-Venus-1.5 represents a significant advancement in autonomous GUI agents with implications across multiple sectors. The unified, end-to-end architecture establishes a new performance ceiling for GUI automation, potentially accelerating adoption of AI-driven automation in enterprise software, mobile applications, and web services. The open-source release democratizes access to state-of-the-art GUI agent technology, enabling broader research and commercial applications. The platform-agnostic design—operating at the pixel level regardless of underlying OS or application type—suggests future convergence toward universal automation agents that can seamlessly handle heterogeneous digital environments. The multi-stage training approach combining massive mid-training with online reinforcement learning may become a standard paradigm for training long-horizon task agents. Support for 40+ mainstream Chinese applications indicates potential for localized AI agent ecosystems, with implications for regional technology sovereignty and alternative development pathways outside Western AI infrastructure. The demonstrated scalability across parameter sizes (2B-30B) suggests viable deployment options ranging from edge devices to cloud infrastructure, potentially enabling widespread integration into existing software stacks.
⏳ Timeline
📎 Sources (5)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
Weekly AI Recap
Read this week's curated digest of top AI events →
👉Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: 机器之心 ↗