Research
Benchmarking web agents on cognitive primitives
70 tasks that test whether AI agents can navigate complex web interfaces — requiring memory, planning, backtracking, and adversarial robustness, not just clicking buttons.
70tasks
12cognitive primitives
5difficulty tiers
Environment
Agents see structure, not pixels
Each task presents a fully interactive Gmail simulation. The agent perceives the interface as an indexed accessibility tree and acts through structured commands.
accessibility tree · gmail_thread_detective
[1] button "Compose"
[2] navigation "Inbox (23)"
[3] navigation "Starred"
[4] navigation "Sent"
[5] row "Dr. Sarah Chen · Re: Lab meeting rescheduled — Hi team, given the..."
···
[2] navigation "Inbox (23)"
[3] navigation "Starred"
[4] navigation "Sent"
[5] row "Dr. Sarah Chen · Re: Lab meeting rescheduled — Hi team, given the..."
···
Taxonomy
Twelve cognitive primitives
Tasks are designed to isolate specific capabilities. Each task targets 2–3 primary primitives, exposing where agents succeed and where they break down.
memoryplanningattentionexplorationbacktrackingadversarialpatienceverificationarithmeticcomprehensioncompositionresilience