Documentation / Architecture
Architecture
LLMOS is built around a clean separation between deterministic Python logic and LLM predictions. This page covers the core architectural decisions that make the simulator and benchmark interoperable.
Sandwich Architecture
Python handles all deterministic operations — input validation, state mutation, patching — while LLMs are responsible only for predictions. This “sandwich” pattern keeps the system auditable and prevents LLMs from corrupting internal state directly.
| Layer | Responsibility | Handler |
|---|---|---|
| Input validation | Parse and validate agent actions before they reach the simulator | Python |
| Prediction | Predict how the UI state changes in response to an action | LLM |
| State mutation | Apply the predicted patch to the canonical state object | Python |
Unified Agent Format
Both the LLMOS simulator and WebAgentBench use the same indexed accessibility tree format for observations. This means an agent trained in simulation sees identical input structure when evaluated on a real browser.
Observations are rendered as indented trees with numeric reference indices:
[2] textbox "Search"
[3] option "Option A"
Actions are JSON objects that reference elements by their index:
Adapters
Two adapter modules in shared/ bridge the unified format to their respective backends:
llmos_adapter.py
Converts between LLMOS internal state (bid-based element identifiers) and the unified indexed accessibility tree format. Used during simulator episodes so the agent always sees the standard observation structure.
playwright_adapter.py
Converts between Playwright's aria_snapshot format and the unified indexed tree, and also executes unified actions on real browser pages. Used by WebAgentBench during live evaluation.
State Visibility Rules
Different actors in the system see different slices of the world state. Hidden state exists to simulate real-world unknowns the agent must discover through interaction.
| Actor | Visible state |
|---|---|
| Simulator | Full state including hidden_state |
| Agent | Filtered observation only — no hidden information |
| Judge | Full state plus complete episode history |
Episode Loop
Each episode follows a fixed four-phase loop. The agent never interacts with the simulator directly — all state transitions go through the validated Python layer.
Reset
Initialize state
Agent.act
obs → action
Simulator.step
action → patch
Judge.evaluate
score episode
The loop repeats Agent.act → Simulator.step until the episode terminates, then Judge.evaluate scores the final state.
Multi-Provider LLM Support
Both the agent and simulator support three LLM providers, selectable independently via CLI flags. The vLLM provider uses an OpenAI-compatible endpoint, which is also how Tinker inference works.
GPT-4o, GPT-4o-mini, and other OpenAI models via the standard API.
Gemini 2.5 Pro, Gemini Flash, and other Google models via the Gemini API.
Self-hosted or Tinker-hosted models via an OpenAI-compatible endpoint. Used for finetuned Qwen inference.