๐Ÿค–Stalecollected in 11h

ClawBench: AI Agents at 33% on Real Tasks

PostLinkedIn
๐Ÿค–Read original on Reddit r/MachineLearning

๐Ÿ’กNew benchmark shows top agents fail 67% on live web tasks

โšก 30-Second TL;DR

What Changed

153 tasks on 144 live production websites

Why It Matters

Highlights agent limitations on real-world tasks, spurring better browser agent development amid low success rates.

What To Do Next

Install clawbench-eval via pip and run your agent on ClawBench dataset.

Who should care:Researchers & Academics

๐Ÿง  Deep Insight

AI-generated analysis for this event.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขClawBench utilizes a 'sandbox-first' architecture that isolates agent interactions within ephemeral browser environments to prevent side effects on live production databases.
  • โ€ขThe benchmark incorporates a dynamic 'human-in-the-loop' verification layer, where human annotators validate success criteria for ambiguous tasks that automated scripts fail to parse.
  • โ€ขThe dataset includes a specific 'anti-hallucination' subset designed to test agent resilience against deceptive UI elements, such as fake 'download' buttons or misleading navigation prompts.
๐Ÿ“Š Competitor Analysisโ–ธ Show
FeatureClawBenchWebArenaOSWorld
EnvironmentLive Production SitesSimulated/SandboxedOS-level/Desktop
Task Scope153 Everyday Tasks812 Web TasksSystem/App Tasks
EvaluationHuman Ground-TruthAutomated/ScriptedAutomated/Scripted
FocusReal-world reliabilityWeb-based reasoningCross-application workflows

๐Ÿ› ๏ธ Technical Deep Dive

  • Agent-Browser Interface: Uses a custom CDP (Chrome DevTools Protocol) wrapper to inject accessibility trees and DOM snapshots into the model context window.
  • Success Metric: Employs a multi-stage verification pipeline: (1) DOM state validation, (2) visual screenshot comparison, and (3) final human verification for edge cases.
  • Safety Protocol: Implements a 'read-only' proxy layer for all agent requests to prevent unauthorized data modification or account deletion on live sites.
  • Model Input: Standardizes inputs into a serialized JSON representation of the accessibility tree, stripping non-essential CSS and scripts to optimize token usage.

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

Agent success rates will plateau below 50% until models achieve native long-term memory.
Current benchmarks show that agents struggle with multi-step tasks that require maintaining state across session timeouts or complex authentication flows.
Standardized web-agent benchmarks will shift toward 'adversarial' environments.
As agents become more capable, benchmarks must include dynamic, changing UI elements to prevent models from overfitting to static website structures.

โณ Timeline

2025-11
Initial release of ClawBench alpha with 50 tasks.
2026-02
Expansion to 153 tasks and integration of live production site testing.
2026-04
Public release of the interactive leaderboard featuring Claude Sonnet 4.6 and GLM-5.
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ†—