Documentation / Cognitive Primitives
Cognitive Primitives
WebAgentBench organizes web agent capabilities into 12 cognitive primitives — the atomic skills required to complete realistic browser tasks. Each benchmark page is designed to stress one or more of these primitives, enabling precise diagnosis of where and why agents fail.
Taxonomy
Maintaining and retrieving relevant context across many steps, pages, or interaction phases.
Finding a name in one email, then referencing it when composing a reply several steps later.
Decomposing a complex task into sub-goals and coordinating constraints across steps.
Applying interdependent filters in the correct order when the sequence matters.
Maintaining goal-directed behavior despite popups, modals, overlays, and distractions.
Dismissing a newsletter modal, cookie banner, and chat widget to reach the actual content.
Systematically searching through alternatives when the initial approach fails.
Loading additional search results and expanding collapsed sections to find hidden data.
Detecting that a chosen path is wrong and reverting to a prior decision point.
Navigating back through a multi-step wizard after discovering a coverage gap.
Resisting dark patterns, misleading labels, confirmshaming, and deliberate UI deceptions.
Recognizing that a prominently placed button subscribes rather than completes purchase.
Waiting for asynchronous content to load and not acting prematurely on incomplete information.
Waiting for a spinner to finish loading additional results before concluding a search.
Confirming that an action achieved its intended effect, especially with misleading feedback.
Checking the sidebar after a success banner to verify settings were actually saved.
Performing numerical calculations correctly within multi-step workflows.
Summing invoice line items and verifying the total matches across documents.
Extracting meaning from complex, multi-part text and following nuanced instructions.
Parsing a long email thread to identify which of several proposed times is conflict-free.
Combining multiple primitives within a single task to achieve a complex goal.
A task requiring exploration to find data, memory to retain it, and planning to act on it.
Handling failures gracefully — interpreting errors, retrying with modifications, preserving progress.
Persisting through form submission errors and saving a draft before a session-clearing failure.
Research Basis
The primitives taxonomy is grounded in published findings on web agent failure modes:
- Agents achieve only 30–61% on realistic web tasks (Online-Mind2Web, COLM 2025)
- Injecting realistic network errors causes 70–95% performance drops (WAREX, 2025)
- Vision-language agents click adversarial pop-ups 86–100% of the time (PopupAttack, ACL 2025)
- A single dark pattern compromises agent intent in 41% of runs (Ersoy et al., IEEE S&P 2026)
- Explicit backtracking improves success by ~7.6% on GUI benchmarks (BacktrackAgent, EMNLP 2025)
- Separating planning from execution improves WebArena-Lite to 57.58% (Plan-and-Act, ICML 2025)