ClawBench evaluates agents on 153 live production websites
A new evaluation framework tests agent performance across 144 real production websites without causing side effects. Claude Sonnet achieved 33.3% on the benchmark, establishing a baseline for agent capability measurement.
ClawBench is an evaluation framework spanning 153 tasks across 144 live production websites in 15 categories, including completing purchases, booking appointments, and submitting job applications. Unlike prior benchmarks that ran in sandboxes, ClawBench operates on real production sites and intercepts only the final submission request to keep evaluation safe without real-world side effects. Claude Sonnet 4.6 achieved the best frontier-model score at 33.3%.
Why real-world testing matters