Research

Benchmarking web agents on cognitive primitives

70 tasks that test whether AI agents can navigate complex web interfaces — requiring memory, planning, backtracking, and adversarial robustness, not just clicking buttons.


70tasks
12cognitive primitives
5difficulty tiers

Environment

Agents see structure, not pixels

Each task presents a fully interactive Gmail simulation. The agent perceives the interface as an indexed accessibility tree and acts through structured commands.

accessibility tree · gmail_thread_detective
[1] button "Compose"
[2] navigation "Inbox (23)"
[3] navigation "Starred"
[4] navigation "Sent"
[5] row "Dr. Sarah Chen · Re: Lab meeting rescheduled — Hi team, given the..."
···

Taxonomy

Twelve cognitive primitives

Tasks are designed to isolate specific capabilities. Each task targets 2–3 primary primitives, exposing where agents succeed and where they break down.

memoryplanningattentionexplorationbacktrackingadversarialpatienceverificationarithmeticcomprehensioncompositionresilience