Documentation / Benchmark
Benchmark
Task structure, the Gmail environment, difficulty tiers, version history, and how to run WebAgentBench.
Task Structure
Each task is defined as a YAML document with four top-level sections. Together they specify what the agent must do, what the environment looks like, and how success is measured.
Identifies the task: id, page, difficulty, and the list of primitives the task exercises. The primitives list links each task to the cognitive taxonomy and is used by the results dashboard to break down performance by skill area.
A natural-language template shown to the agent. Placeholders of the form {{target.field}} are resolved against the seed's target object at runtime. This keeps the instruction human-readable in the YAML while remaining parametric across fixture variations.
Find the most recent email from {{target.sender}} and reply
with the meeting time {{target.time}}.Drives the fixture generator. Contains three sub-sections:
| Key | Purpose |
|---|---|
| actors | Named people with roles (e.g. target sender, distractor sender). Resolved to random names and addresses at generation time. |
| steps | Ordered list of seeder operations that build the initial inbox state — compose email, label, archive, etc. |
| distractors | Additional emails or UI elements injected to test attention and filtering. Controlled by count and similarity parameters. |
Defines the scoring criteria as two lists. Positive checks are assertions that must be true for the agent to receive credit — each contributes 1 / total to the base score. Negative checks are guard-rail assertions — behaviours the agent must avoid — each carrying an explicit penalty subtracted from the base score.
Gmail Environment
The Gmail page is a fully interactive React simulation of Gmail. It supports composing, replying, forwarding, labelling, archiving, starring, and searching — enough surface area to host tasks that span multiple cognitive primitives simultaneously.
Unlike a simple mock, the Gmail simulation maintains a live in-memory state so the agent's actions have observable consequences: a sent email appears in Sent, an archived thread leaves the inbox, and a label persists across navigation.
Fixture Generation Pipeline
| Stage | What happens |
|---|---|
| 1. Actor generation | Seed actor roles are resolved to random names, email addresses, and avatar colours. The same role always maps to the same identity within a fixture so evals are deterministic. |
| 2. Seeder steps | Each step in the seed is executed against an empty inbox: compose emails with templated bodies, apply labels, set read/unread state, archive threads, etc. |
| 3. Distractor injection | Distractor emails are generated from the distractor spec and inserted at randomised positions in the inbox to avoid positional bias. |
| 4. Target resolution | All {{target.*}} placeholders in the instruction template are resolved against the finalised fixture, producing the exact string shown to the agent. |
Difficulty Tiers
Tasks are grouped into four difficulty tiers based on expected step count and the number of cognitive primitives exercised simultaneously. The tiers are calibrated against human baselines, not model performance.
| Tier | Steps | Primitives | Description |
|---|---|---|---|
| Easy | 5 – 10 | Single-primitive | Isolated skill test. Straightforward instruction, minimal distractors, short action chain. |
| Medium | 10 – 20 | 2 – 3 primitives | Requires combining skills such as memory with attention, or patience with backtracking. |
| Hard | 20 – 35 | Complex reasoning | Multi-step plans, conditional logic, and adversarial distractors designed to mislead. |
| Expert | 30 – 50 | All primitives | Full-coverage gauntlet. Requires near-perfect planning, resistance to distraction, and recovery from dead ends. |
Version History
WebAgentBench has gone through ten versioned releases. Major milestones are listed below; see webagentbench/CHANGELOG.md for the full per-release notes.
| Version | Pages | Summary |
|---|---|---|
| v1 | 10 | Initial release. Core page set with basic task definitions and programmatic scoring. |
| v2 – v4 | 10 | Iterative scoring refinement: negative checks added in v2, trajectory modifier in v3, calibrated weights in v4. |
| v5 | 12 | Major redesign. Two new pages, unified YAML task format, and the actor/seed/distractor fixture pipeline. |
| v6 – v8 | 15 | Frontier pages targeting LLM weak spots: adversarial checkout, deep wizard form, and the Gmail environment introduced in v6. |
| v9 | 15 | Hardening release. Tightened eval criteria, distractor density increased, seed stability test suite added. |
| v10 | 15 | Shared-runtime release. Unified indexed accessibility-tree format across simulator and real browser; all pages migrated to the shared adapter. |
Running the Benchmark
WebAgentBench evaluations are driven by webagentbench/agent_eval.py. It spins up a local FastAPI server, initialises each page with a seeded fixture, then runs the agent against the live DOM via Playwright.
# Evaluate on all 15 pages
python -m webagentbench.agent_eval --model gpt-4o --provider openai
# Specific pages only
python -m webagentbench.agent_eval --model gpt-4o --provider openai \
--pages dark_checkout wizard_form gmail
# With visible browser (useful for debugging)
python -m webagentbench.agent_eval --model gpt-4o --provider openai --no-headlessResults are written to results/webagentbench/results.json and can be visualised with python -m webagentbench.visualize results/webagentbench/results.json.