ClawBench evaluates agents on 153 live production websites
A new evaluation framework tests agent performance across 144 real production websites without causing side effects. Claude Sonnet achieved 33.3% on the benchmark, establishing a baseline for agent capability measurement.
ClawBench is an evaluation framework spanning 153 tasks across 144 live production websites in 15 categories, including completing purchases, booking appointments, and submitting job applications. Unlike prior benchmarks that ran in sandboxes, ClawBench operates on real production sites and intercepts only the final submission request to keep evaluation safe without real-world side effects. Claude Sonnet 4.6 achieved the best frontier-model score at 33.3%.
Why real-world testing matters
Sandbox environments often fail to capture the complexity of production systems: dynamic content, rate limiting, authentication flows, and unexpected UI variations. ClawBench's approach of running live tasks while blocking final submissions provides a more realistic measure of agent robustness. This matters because agents trained on synthetic data often fail when deployed against actual websites.
Training infrastructure for 2026
If 2025 was the year of the computer-use agent, 2026 will be the year of computer-use agent training, and training requires verifiers. ClawBench provides the kind of concrete, measurable feedback loop that frontier labs need to iterate on agent architectures. The 33.3% baseline also establishes clear room for improvement, signaling that agent capability on real-world tasks remains far from saturation.