Documentation / Benchmark

Benchmark

Task structure, the Gmail environment, difficulty tiers, version history, and how to run WebAgentBench.

Task Structure

Each task is defined as a YAML document with four top-level sections. Together they specify what the agent must do, what the environment looks like, and how success is measured.

metadata

Identifies the task: id, page, difficulty, and the list of primitives the task exercises. The primitives list links each task to the cognitive taxonomy and is used by the results dashboard to break down performance by skill area.

instruction

A natural-language template shown to the agent. Placeholders of the form {{target.field}} are resolved against the seed's target object at runtime. This keeps the instruction human-readable in the YAML while remaining parametric across fixture variations.

Find the most recent email from {{target.sender}} and reply
with the meeting time {{target.time}}.
seed

Drives the fixture generator. Contains three sub-sections:

KeyPurpose
actorsNamed people with roles (e.g. target sender, distractor sender). Resolved to random names and addresses at generation time.
stepsOrdered list of seeder operations that build the initial inbox state — compose email, label, archive, etc.
distractorsAdditional emails or UI elements injected to test attention and filtering. Controlled by count and similarity parameters.
eval

Defines the scoring criteria as two lists. Positive checks are assertions that must be true for the agent to receive credit — each contributes 1 / total to the base score. Negative checks are guard-rail assertions — behaviours the agent must avoid — each carrying an explicit penalty subtracted from the base score.

Gmail Environment

The Gmail page is a fully interactive React simulation of Gmail. It supports composing, replying, forwarding, labelling, archiving, starring, and searching — enough surface area to host tasks that span multiple cognitive primitives simultaneously.

Unlike a simple mock, the Gmail simulation maintains a live in-memory state so the agent's actions have observable consequences: a sent email appears in Sent, an archived thread leaves the inbox, and a label persists across navigation.

Fixture Generation Pipeline

StageWhat happens
1. Actor generationSeed actor roles are resolved to random names, email addresses, and avatar colours. The same role always maps to the same identity within a fixture so evals are deterministic.
2. Seeder stepsEach step in the seed is executed against an empty inbox: compose emails with templated bodies, apply labels, set read/unread state, archive threads, etc.
3. Distractor injectionDistractor emails are generated from the distractor spec and inserted at randomised positions in the inbox to avoid positional bias.
4. Target resolutionAll {{target.*}} placeholders in the instruction template are resolved against the finalised fixture, producing the exact string shown to the agent.

Difficulty Tiers

Tasks are grouped into four difficulty tiers based on expected step count and the number of cognitive primitives exercised simultaneously. The tiers are calibrated against human baselines, not model performance.

TierStepsPrimitivesDescription
Easy5 – 10Single-primitiveIsolated skill test. Straightforward instruction, minimal distractors, short action chain.
Medium10 – 202 – 3 primitivesRequires combining skills such as memory with attention, or patience with backtracking.
Hard20 – 35Complex reasoningMulti-step plans, conditional logic, and adversarial distractors designed to mislead.
Expert30 – 50All primitivesFull-coverage gauntlet. Requires near-perfect planning, resistance to distraction, and recovery from dead ends.

Version History

WebAgentBench has gone through ten versioned releases. Major milestones are listed below; see webagentbench/CHANGELOG.md for the full per-release notes.

VersionPagesSummary
v110Initial release. Core page set with basic task definitions and programmatic scoring.
v2 – v410Iterative scoring refinement: negative checks added in v2, trajectory modifier in v3, calibrated weights in v4.
v512Major redesign. Two new pages, unified YAML task format, and the actor/seed/distractor fixture pipeline.
v6 – v815Frontier pages targeting LLM weak spots: adversarial checkout, deep wizard form, and the Gmail environment introduced in v6.
v915Hardening release. Tightened eval criteria, distractor density increased, seed stability test suite added.
v1015Shared-runtime release. Unified indexed accessibility-tree format across simulator and real browser; all pages migrated to the shared adapter.

Running the Benchmark

WebAgentBench evaluations are driven by webagentbench/agent_eval.py. It spins up a local FastAPI server, initialises each page with a seeded fixture, then runs the agent against the live DOM via Playwright.

# Evaluate on all 15 pages
python -m webagentbench.agent_eval --model gpt-4o --provider openai

# Specific pages only
python -m webagentbench.agent_eval --model gpt-4o --provider openai \
    --pages dark_checkout wizard_form gmail

# With visible browser (useful for debugging)
python -m webagentbench.agent_eval --model gpt-4o --provider openai --no-headless

Results are written to results/webagentbench/results.json and can be visualised with python -m webagentbench.visualize results/webagentbench/results.json.