The DOM Pipeline: From Webpage to LLM and Back

How PageAgent converts a live DOM into text the LLM can reason about, then translates actions back into real clicks

1
Extract
// Live DOM <form id="login"> <input type="email" placeholder="Email" class="field-3x..." data-testid="email" aria-label="Email" /> <input type="password" placeholder="Password" /> <button type="submit"> Log In </button> </form>
Full DOM with all attributes, classes, and nesting
2
Dehydrate
// Simplified HTML for LLM // (text-based, no screenshots) [14]<input Email /> [15]<input Password /> [16]<button>Log In</button> // Each interactive element // gets a numeric index. // Non-interactive nodes // are stripped away.
Stripped to indexed interactive elements only
3
LLM Thinks
// LLM response (MacroToolInput) { "reflection": "I see a login form. I need to click Log In.", "action": { "name": "click_element_by_index", "args": { "index": 16 } } }
Reflects on state, picks a tool and target index
4
Execute
// PageController executes pageController .clickElement(16) // Resolves index 16 back // to the real DOM node // and fires a click event. // SimulatorMask shows a // visual highlight on the // clicked element so the // user sees what happened.
Index maps back to real DOM node, click fires
Key insight: No screenshots. No multi-modal models. PageAgent converts the DOM to text, letting any cheap text-only LLM drive the interface. This keeps costs low and latency under a second per step.