Documentation / Architecture

Architecture

LLMOS is built around a clean separation between deterministic Python logic and LLM predictions. This page covers the core architectural decisions that make the simulator and benchmark interoperable.

Sandwich Architecture

Python handles all deterministic operations — input validation, state mutation, patching — while LLMs are responsible only for predictions. This “sandwich” pattern keeps the system auditable and prevents LLMs from corrupting internal state directly.

Layer	Responsibility	Handler
Input validation	Parse and validate agent actions before they reach the simulator	Python
Prediction	Predict how the UI state changes in response to an action	LLM
State mutation	Apply the predicted patch to the canonical state object	Python

Unified Agent Format

Both the LLMOS simulator and WebAgentBench use the same indexed accessibility tree format for observations. This means an agent trained in simulation sees identical input structure when evaluated on a real browser.

Observations are rendered as indented trees with numeric reference indices:

[1] button "Settings"
[2] textbox "Search"
[3] option "Option A"

Actions are JSON objects that reference elements by their index:

// click

{"action": "click", "ref": 1}

// fill

{"action": "fill", "ref": 2, "value": "hello"}

Adapters

Two adapter modules in shared/ bridge the unified format to their respective backends:

llmos_adapter.py

Converts between LLMOS internal state (bid-based element identifiers) and the unified indexed accessibility tree format. Used during simulator episodes so the agent always sees the standard observation structure.

playwright_adapter.py

Converts between Playwright's aria_snapshot format and the unified indexed tree, and also executes unified actions on real browser pages. Used by WebAgentBench during live evaluation.

State Visibility Rules

Different actors in the system see different slices of the world state. Hidden state exists to simulate real-world unknowns the agent must discover through interaction.

Actor	Visible state
Simulator	Full state including hidden_state
Agent	Filtered observation only — no hidden information
Judge	Full state plus complete episode history

Episode Loop

Each episode follows a fixed four-phase loop. The agent never interacts with the simulator directly — all state transitions go through the validated Python layer.

Reset

Initialize state

→

Agent.act

obs → action

→

Simulator.step

action → patch

→

Judge.evaluate

score episode

The loop repeats Agent.act → Simulator.step until the episode terminates, then Judge.evaluate scores the final state.

Multi-Provider LLM Support

Both the agent and simulator support three LLM providers, selectable independently via CLI flags. The vLLM provider uses an OpenAI-compatible endpoint, which is also how Tinker inference works.

openai--agent-provider openai

GPT-4o, GPT-4o-mini, and other OpenAI models via the standard API.

gemini--agent-provider gemini

Gemini 2.5 Pro, Gemini Flash, and other Google models via the Gemini API.

vllm--agent-provider vllm

Self-hosted or Tinker-hosted models via an OpenAI-compatible endpoint. Used for finetuned Qwen inference.