Documentation / Scoring

Scoring

How WebAgentBench turns a trajectory into a single number between 0 and 1.

The Score Formula

Every task produces a final score in the range [0, 1]. The formula combines a base score derived from positive criteria, any penalties from negative checks, and a small trajectory efficiency modifier:

final_score = clamp(0-1, base_score - penalties + trajectory_mod)

The result is always clamped to [0, 1], so penalties can never push the score below zero and the trajectory modifier can never push it above one.

Base Score

The base score is the fraction of positive criteria the agent satisfied:

base_score = passed_criteria / total_criteria

Each criterion carries an implied weight of 1 / total_criteria. Criteria are defined per-task in the benchmark manifest and cover observable outcomes such as sent emails, filled form fields, clicked buttons, and navigated states.

Negative Checks (Penalties)

In addition to positive criteria, tasks may define negative checks — guard-rail assertions that test for behaviours the agent must avoid. Each failing negative check subtracts an explicit penalty value from the base score.

Negative checks model real-world constraints such as privacy requirements, data integrity rules, and interaction boundaries. An agent that completes the task but violates a guard-rail should score lower than one that stops cleanly.

For a full list of negative checks used across the benchmark, see the negative checks reference.

Trajectory Modifier

A small modifier rewards agents that complete tasks efficiently and penalises those that wander. It is computed from the ratio of steps taken to the task's reference step count, then clamped to [−0.10, +0.10]:

Category	Condition	Modifier
Efficient	steps ≤ 70% of reference	+0.03
Normal	70% < steps ≤ 180%	0.00
Excessive	steps > 180% of reference	−0.05

The modifier is clamped to [−0.10, +0.10] before being added to the score, so no single efficiency signal can dominate the final result.

Worked Example

Consider the Thread Detective task, which has 5 positive criteria and 2 negative checks. The agent passed 4 of 5 positive criteria and all negative checks, completing the task in the normal step range.

Check	Type	Result	Impact
Exactly one email was sent	Positive	✓	+0.20
Reply sent to correct sender	Positive	✓	+0.20
Contains correct time	Positive	✓	+0.20
Reply is threaded	Positive	✓	+0.20
Targets most recent thread	Positive	✗	+0.00
No conflicting times mentioned	Negative	✓	0.00
Not Reply All	Negative	✓	0.00

base_score     = 4 / 5 = 0.80
penalties      = 0.00
trajectory_mod = 0.00

final_score    = clamp(0-1, 0.80 - 0.00 + 0.00) = 0.80

Pass / Fail

A task is considered passed only when the agent satisfies all positive criteria and all negative checks. Partial credit is reflected in the numeric score, but the binary pass/fail label requires a perfect run.

This strict definition ensures that leaderboard pass-rates measure complete, safe task completion rather than partial progress, making them a more reliable signal for comparing agents.