Documentation / Training Pipeline

Training Pipeline

How LLMOS fine-tunes web agents using trajectories collected from the simulator and the real benchmark browser — from raw episodes to a deployed model.

Overview

The pipeline runs in three sequential stages, each building on the outputs of the last:

Data Collection — gather and filter trajectories from the LLMOS simulator and live WebAgentBench browser runs, then export them as conversation-format JSONL.
SFT (Supervised Fine-Tuning) — train the base model on successful trajectories using next-token prediction on the assistant turns only.
DPO (Direct Preference Optimisation) — fine-tune further using paired trajectories (good vs. bad) to push the model toward higher-scoring behaviours without a separate reward model.

Data Collection

Training data comes from two complementary sources:

LLMOS Simulator — the LLM-based UI simulator generates large volumes of synthetic trajectories cheaply. The collector analyses WebAgentBench failure reports to target the exact task types where the agent currently struggles.
WebAgentBench browser runs — real Playwright episodes against the 15-page benchmark produce ground-truth trajectories with accurate scores, providing a higher-fidelity signal for the most important tasks.

# Collect from simulator (analyzes WAB failures)
python -m llmos collect --wab-results results/webagentbench/baseline.json \
    --episodes 20 --output training/data/raw_episodes.jsonl

# Prepare training data
python training/prepare_data.py --llmos-dir llmos/runs/ \
    --min-score 0.0 --output training/data/train.jsonl

SFT Stage

Supervised fine-tuning trains the model to imitate successful trajectories. Each episode is serialised as a multi-turn conversation in OpenAI message format: system prompt, then alternating user (observation) and assistant (action) turns. Loss is computed on the assistant turns only, so the model learns to produce well-formed actions given the accessibility-tree observation.

Training runs on Qwen models via the Tinker cloud GPU API, which allocates on-demand A100 capacity and exposes an OpenAI-compatible endpoint for inference once the job completes.

# SFT on Qwen via Tinker API
python training/train_sft.py --data training/data/train.jsonl \
    --model Qwen/Qwen3-30B-A3B

DPO Stage

Direct Preference Optimisation refines the SFT checkpoint using pairs of trajectories on the same task — one higher-scoring (chosen) and one lower-scoring (rejected). The model learns to prefer the chosen trajectory without needing a separate reward model, making DPO significantly cheaper than RLHF while still improving alignment with the benchmark scoring signal.

Pairs are generated automatically by prepare_dpo.py, which groups episodes by task, sorts by score, and produces (high, low) pairs from the collected runs.

# Prepare DPO pairs
python training/prepare_dpo.py ...

# DPO training
python training/train_dpo.py ...

Tinker API

All GPU-intensive training and inference runs through the Tinker cloud GPU API. Tinker provides three capabilities used by the pipeline:

On-demand GPU allocation — A100 nodes are spun up only for the duration of a training job, keeping costs proportional to compute actually used.
OpenAI-compatible inference — after a training job completes, the resulting checkpoint is automatically deployed behind an endpoint that speaks the OpenAI Chat Completions API, so the existing agent code requires no changes to call the fine-tuned model.
Model registry — trained checkpoints are versioned in Tinker's model registry, making it easy to roll back to a previous checkpoint or run A/B evaluations between versions.

Model Targets

The pipeline currently targets two Qwen model families:

Qwen/Qwen3-30B-A3B — a Mixture-of-Experts architecture that activates 3B parameters per forward pass while retaining a 30B parameter pool. This gives strong reasoning capability at inference cost closer to a dense 3B model, making it the primary training target.
Qwen/Qwen2.5-72B-Instruct — a dense 72B instruction-tuned model used as a high-capacity baseline. Benchmarking against this model provides an upper bound on how much headroom remains for the smaller MoE checkpoint after fine-tuning.