Documentation / Scoring
Scoring
How WebAgentBench turns a trajectory into a single number between 0 and 1.
The Score Formula
Every task produces a final score in the range [0, 1]. The formula combines a base score derived from positive criteria, any penalties from negative checks, and a small trajectory efficiency modifier:
final_score = clamp(0-1, base_score - penalties + trajectory_mod)The result is always clamped to [0, 1], so penalties can never push the score below zero and the trajectory modifier can never push it above one.
Base Score
The base score is the fraction of positive criteria the agent satisfied:
base_score = passed_criteria / total_criteriaEach criterion carries an implied weight of 1 / total_criteria. Criteria are defined per-task in the benchmark manifest and cover observable outcomes such as sent emails, filled form fields, clicked buttons, and navigated states.
Negative Checks (Penalties)
In addition to positive criteria, tasks may define negative checks — guard-rail assertions that test for behaviours the agent must avoid. Each failing negative check subtracts an explicit penalty value from the base score.
Negative checks model real-world constraints such as privacy requirements, data integrity rules, and interaction boundaries. An agent that completes the task but violates a guard-rail should score lower than one that stops cleanly.
For a full list of negative checks used across the benchmark, see the negative checks reference.
Trajectory Modifier
A small modifier rewards agents that complete tasks efficiently and penalises those that wander. It is computed from the ratio of steps taken to the task's reference step count, then clamped to [−0.10, +0.10]:
| Category | Condition | Modifier |
|---|---|---|
| Efficient | steps ≤ 70% of reference | +0.03 |
| Normal | 70% < steps ≤ 180% | 0.00 |
| Excessive | steps > 180% of reference | −0.05 |
The modifier is clamped to [−0.10, +0.10] before being added to the score, so no single efficiency signal can dominate the final result.
Worked Example
Consider the Thread Detective task, which has 5 positive criteria and 2 negative checks. The agent passed 4 of 5 positive criteria and all negative checks, completing the task in the normal step range.
| Check | Type | Result | Impact |
|---|---|---|---|
| Exactly one email was sent | Positive | ✓ | +0.20 |
| Reply sent to correct sender | Positive | ✓ | +0.20 |
| Contains correct time | Positive | ✓ | +0.20 |
| Reply is threaded | Positive | ✓ | +0.20 |
| Targets most recent thread | Positive | ✗ | +0.00 |
| No conflicting times mentioned | Negative | ✓ | 0.00 |
| Not Reply All | Negative | ✓ | 0.00 |
base_score = 4 / 5 = 0.80
penalties = 0.00
trajectory_mod = 0.00
final_score = clamp(0-1, 0.80 - 0.00 + 0.00) = 0.80Pass / Fail
A task is considered passed only when the agent satisfies all positive criteria and all negative checks. Partial credit is reflected in the numeric score, but the binary pass/fail label requires a perfect run.
This strict definition ensures that leaderboard pass-rates measure complete, safe task completion rather than partial progress, making them a more reliable signal for comparing agents.