ClawBench: AI Agents at 33% on Real Tasks

Post LinkedIn

🤖Read original on Reddit r/MachineLearning

#ai-agents #benchmark #browser-automationclawbench

💡New benchmark shows top agents fail 67% on live web tasks

⚡ 30-Second TL;DR

What Changed

153 tasks on 144 live production websites

Why It Matters

Highlights agent limitations on real-world tasks, spurring better browser agent development amid low success rates.

What To Do Next

Install clawbench-eval via pip and run your agent on ClawBench dataset.

Who should care:Researchers & Academics

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

•ClawBench utilizes a 'sandbox-first' architecture that isolates agent interactions within ephemeral browser environments to prevent side effects on live production databases.
•The benchmark incorporates a dynamic 'human-in-the-loop' verification layer, where human annotators validate success criteria for ambiguous tasks that automated scripts fail to parse.
•The dataset includes a specific 'anti-hallucination' subset designed to test agent resilience against deceptive UI elements, such as fake 'download' buttons or misleading navigation prompts.

📊 Competitor Analysis▸ Show

Feature	ClawBench	WebArena	OSWorld
Environment	Live Production Sites	Simulated/Sandboxed	OS-level/Desktop
Task Scope	153 Everyday Tasks	812 Web Tasks	System/App Tasks
Evaluation	Human Ground-Truth	Automated/Scripted	Automated/Scripted
Focus	Real-world reliability	Web-based reasoning	Cross-application workflows

🛠️ Technical Deep Dive

Agent-Browser Interface: Uses a custom CDP (Chrome DevTools Protocol) wrapper to inject accessibility trees and DOM snapshots into the model context window.
Success Metric: Employs a multi-stage verification pipeline: (1) DOM state validation, (2) visual screenshot comparison, and (3) final human verification for edge cases.
Safety Protocol: Implements a 'read-only' proxy layer for all agent requests to prevent unauthorized data modification or account deletion on live sites.
Model Input: Standardizes inputs into a serialized JSON representation of the accessibility tree, stripping non-essential CSS and scripts to optimize token usage.

🔮 Future ImplicationsAI analysis grounded in cited sources

Agent success rates will plateau below 50% until models achieve native long-term memory.

Current benchmarks show that agents struggle with multi-step tasks that require maintaining state across session timeouts or complex authentication flows.

Standardized web-agent benchmarks will shift toward 'adversarial' environments.

As agents become more capable, benchmarks must include dynamic, changing UI elements to prevent models from overfitting to static website structures.

⏳ Timeline

2025-11

Initial release of ClawBench alpha with 50 tasks.

2026-02

Expansion to 153 tasks and integration of live production site testing.

2026-04

Public release of the interactive leaderboard featuring Claude Sonnet 4.6 and GLM-5.

🤖Read original article on Reddit r/MachineLearning

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #ai-agents

Same product

Melbourne Airport Deploys AI for Incidents

iTNews Australia•May 3

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning ↗