Ant's UI-Venus-1.5 Tops SOTA for GUI Agents
🧠#gui-agent#mobile-automation#sota-benchmarkFreshcollected in 22m

Ant's UI-Venus-1.5 Tops SOTA for GUI Agents

PostLinkedIn
🧠Read original on 机器之心

💡Open-source SOTA GUI agent runs 40+ Chinese apps end-to-end

⚡ 30-Second TL;DR

What changed

SOTA performance in GUI agent benchmarks

Why it matters

Accelerates deployable GUI agents for real apps, lowering barriers for practical AI assistants in mobile/web automation.

What to do next

Clone UI-Venus-1.5 from GitHub and fine-tune for custom Chinese app automation.

Who should care:Developers & AI Engineers

🧠 Deep Insight

Web-grounded analysis with 5 cited sources.

🔑 Key Takeaways

  • UI-Venus-1.5-30B-A3B achieves state-of-the-art performance across multiple GUI agent benchmarks, reaching 77.6% on AndroidWorld, 76.4% on OSWorld-G-R, and 21.5% on VenusBench-Mobile, consistently outperforming specialized GUI models like MAI-UI-32B and general-purpose VLMs like GPT-4o[1][2]
  • The model employs a sophisticated multi-stage training pipeline featuring a massive mid-training phase utilizing 10 billion tokens across 30+ datasets to establish foundational GUI semantics, followed by scaled online reinforcement learning with full-trajectory rollouts[2]
  • UI-Venus-1.5 is built on the Qwen3-VL architecture and uses a model merge strategy that synthesizes specialized grounding, web, and mobile capabilities into a single unified checkpoint, enabling platform-agnostic operation across native apps, websites, and remote desktops[2]
📊 Competitor Analysis▸ Show
ModelDeveloperKey Benchmark (AndroidWorld)ArchitectureTraining Approach
UI-Venus-1.5-30B-A3BAnt Group77.6%Qwen3-VL basedMid-training (10B tokens) + Online RL with full-trajectory rollouts
MAI-UI-32BCompetitor73.9% (OSWorld-G-R)Specialized GUITask-specific optimization
GTA1-32BCompetitor72.2% (OSWorld-G-R)Specialized GUITask-specific optimization
GPT-4oOpenAIComparable/lower on GUI tasksGeneral-purpose VLMGeneral vision-language pretraining
Qwen3-VLAlibabaGeneral VLM baselineGeneral-purpose VLMGeneral vision-language pretraining

🛠️ Technical Deep Dive

Architecture Foundation: Built on Qwen3-VL, a large vision-language model, adapted specifically for GUI understanding and navigation tasks • Multi-Stage Training Pipeline: (1) Mid-training phase with 10 billion tokens across 30+ datasets for GUI semantic understanding; (2) Online reinforcement learning with full-trajectory rollouts for long-horizon task alignment • Model Merge Strategy: Synthesizes three specialized capabilities—grounding, web navigation, and mobile interaction—into a single unified checkpoint through intelligent model merging • Platform Agnosticism: Operates at pixel level, decoupling AI from operating system specifics, enabling operation across native applications, web interfaces, and remote desktops • Performance Variants: Available in 2B (8.7% VenusBench-Mobile success), 8B (16.1%), and 30B-A3B (21.5%) parameter scales with proportional performance improvements • Benchmark Coverage: Evaluated on AndroidWorld (77.6%), AndroidLab (55.1%/68.1%), OSWorld-G-R (76.4%), OSWorld-G (70.6%), VenusBench-Mobile (21.5%), and WebVoyager (76.0%) • Key Innovation: Addresses step-level vs. task-level accuracy disparity by optimizing full interaction sequences rather than individual steps, improving real-world task completion rates

🔮 Future ImplicationsAI analysis grounded in cited sources

UI-Venus-1.5 represents a significant advancement in autonomous GUI agents with implications across multiple sectors. The unified, end-to-end architecture establishes a new performance ceiling for GUI automation, potentially accelerating adoption of AI-driven automation in enterprise software, mobile applications, and web services. The open-source release democratizes access to state-of-the-art GUI agent technology, enabling broader research and commercial applications. The platform-agnostic design—operating at the pixel level regardless of underlying OS or application type—suggests future convergence toward universal automation agents that can seamlessly handle heterogeneous digital environments. The multi-stage training approach combining massive mid-training with online reinforcement learning may become a standard paradigm for training long-horizon task agents. Support for 40+ mainstream Chinese applications indicates potential for localized AI agent ecosystems, with implications for regional technology sovereignty and alternative development pathways outside Western AI infrastructure. The demonstrated scalability across parameter sizes (2B-30B) suggests viable deployment options ranging from edge devices to cloud infrastructure, potentially enabling widespread integration into existing software stacks.

⏳ Timeline

2026-02
Ant Group open-sources UI-Venus-1.5 with technical report, achieving state-of-the-art performance across multiple GUI agent benchmarks

📎 Sources (5)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

  1. arxiv.org
  2. youtube.com
  3. alphaxiv.org
  4. lonepatient.top
  5. documentation.suse.com

Ant Group open-sourced UI-Venus-1.5, a high-performance end-to-end GUI agent topping SOTA benchmarks. It unifies grounding, mobile, and web handling in one model, supporting 40+ mainstream Chinese apps and solving knowledge gaps, sim-to-real issues, and multi-model coordination. Resources include GitHub code, Hugging Face models, and a technical report.

Key Points

  • 1.SOTA performance in GUI agent benchmarks
  • 2.Single model handles grounding, mobile, and web scenarios
  • 3.Supports 40+ mainstream Chinese mobile apps
  • 4.Addresses knowledge gaps, sim-to-real, and multi-model issues
  • 5.Fully open-sourced with code, models, and arXiv report

Impact Analysis

Accelerates deployable GUI agents for real apps, lowering barriers for practical AI assistants in mobile/web automation.

Technical Details

Follows 'high-performance, practical' design for end-to-end processing without complex frameworks. Tackles GUI-specific challenges like obscure icons and app logic via unified training.

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Read Next

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 机器之心