๐คReddit r/MachineLearningโขStalecollected in 11h
ClawBench: AI Agents at 33% on Real Tasks
๐กNew benchmark shows top agents fail 67% on live web tasks
โก 30-Second TL;DR
What Changed
153 tasks on 144 live production websites
Why It Matters
Highlights agent limitations on real-world tasks, spurring better browser agent development amid low success rates.
What To Do Next
Install clawbench-eval via pip and run your agent on ClawBench dataset.
Who should care:Researchers & Academics
๐ง Deep Insight
AI-generated analysis for this event.
๐ Enhanced Key Takeaways
- โขClawBench utilizes a 'sandbox-first' architecture that isolates agent interactions within ephemeral browser environments to prevent side effects on live production databases.
- โขThe benchmark incorporates a dynamic 'human-in-the-loop' verification layer, where human annotators validate success criteria for ambiguous tasks that automated scripts fail to parse.
- โขThe dataset includes a specific 'anti-hallucination' subset designed to test agent resilience against deceptive UI elements, such as fake 'download' buttons or misleading navigation prompts.
๐ Competitor Analysisโธ Show
| Feature | ClawBench | WebArena | OSWorld |
|---|---|---|---|
| Environment | Live Production Sites | Simulated/Sandboxed | OS-level/Desktop |
| Task Scope | 153 Everyday Tasks | 812 Web Tasks | System/App Tasks |
| Evaluation | Human Ground-Truth | Automated/Scripted | Automated/Scripted |
| Focus | Real-world reliability | Web-based reasoning | Cross-application workflows |
๐ ๏ธ Technical Deep Dive
- Agent-Browser Interface: Uses a custom CDP (Chrome DevTools Protocol) wrapper to inject accessibility trees and DOM snapshots into the model context window.
- Success Metric: Employs a multi-stage verification pipeline: (1) DOM state validation, (2) visual screenshot comparison, and (3) final human verification for edge cases.
- Safety Protocol: Implements a 'read-only' proxy layer for all agent requests to prevent unauthorized data modification or account deletion on live sites.
- Model Input: Standardizes inputs into a serialized JSON representation of the accessibility tree, stripping non-essential CSS and scripts to optimize token usage.
๐ฎ Future ImplicationsAI analysis grounded in cited sources
Agent success rates will plateau below 50% until models achieve native long-term memory.
Current benchmarks show that agents struggle with multi-step tasks that require maintaining state across session timeouts or complex authentication flows.
Standardized web-agent benchmarks will shift toward 'adversarial' environments.
As agents become more capable, benchmarks must include dynamic, changing UI elements to prevent models from overfitting to static website structures.
โณ Timeline
2025-11
Initial release of ClawBench alpha with 50 tasks.
2026-02
Expansion to 153 tasks and integration of live production site testing.
2026-04
Public release of the interactive leaderboard featuring Claude Sonnet 4.6 and GLM-5.
๐ฐ
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ
