PageAgent Puts an AI Copilot Inside Every Web Page

Alibaba's open-source JavaScript library lets any LLM control a web interface through natural language. No server, no Python, no headless browser. Just a script tag.

alibaba/page-agent ยท 10 min read

A translucent AI spirit hand reaches from a chat bubble and gently presses a button on a web form. The web page floats in mid-air with indexed elements highlighted. Small numbered tags appear on interactive elements.
An AI agent reaches into the page and clicks for you. That is the entire pitch.
Key Takeaways

The Problem With Web Automation in 2026

The AI browser automation space has exploded. Browser Use crossed 50,000 stars. Stagehand v3 rewrote itself from scratch. Vercel shipped Agent Browser. Every month brings a new Python framework that spins up a headless Chromium, takes screenshots, and feeds pixels to a vision model.

These tools are powerful. They are also heavy. They run on your server, consume GPU cycles for screenshot interpretation, and require an entirely separate runtime from the web application they control.

For a SaaS company that wants to ship an AI copilot inside its own product, this architecture is backwards. You already have the DOM. You already have the user's browser. Why send screenshots to a server when the structured data is sitting right there in the page?

Split composition. Left side shows a convoluted path from a web page through a server with screenshots and a vision model and back. Right side shows a direct short path from a chat bubble straight into the same web page. The contrast between the two approaches is stark.
Server-side screenshot automation versus client-side DOM manipulation. Two fundamentally different philosophies.

One Script Tag, Zero Infrastructure

PageAgent's pitch is almost absurdly simple. Add a single script tag to your HTML and you get a fully functional AI agent living inside the page.

<script src="https://cdn.jsdelivr.net/npm/page-agent@1.5.9/dist/iife/page-agent.demo.js" crossorigin="true"></script>

That one line loads the agent, renders a chat panel, and connects to a free demo LLM endpoint so you can try it immediately. For production use, you bring your own model:

import { PageAgent } from 'page-agent'

const agent = new PageAgent({
    model: 'qwen3.5-plus',
    baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
    apiKey: 'YOUR_API_KEY',
    language: 'en-US',
})

await agent.execute('Click the login button')

No Python. No headless browser. No Playwright. No server round-trip. The agent runs in the user's browser, reads the DOM directly, and sends text to any OpenAI-compatible LLM endpoint. It works with Qwen, GPT-4, Claude, Gemini, or a local Ollama instance.

"No need for browser extension / python / headless browser. Just in-page JavaScript. Everything happens in your web page."

-- PageAgent README

Text, Not Pixels

Most AI browser agents take screenshots and feed them to multi-modal vision models. This approach is expensive. Vision tokens cost more. Latency increases. And you need a model that supports image input.

PageAgent takes a completely different path. It reads the live DOM, extracts interactive elements, strips away non-essential markup, and sends a simplified text representation to the LLM. Each interactive element gets a numeric index. The LLM reasons over text and returns an action referencing that index.

This means any text-only LLM can drive the interface. A cheap Qwen 3.5 model works fine. You do not need GPT-4 Vision or Claude with image support. The cost per agent step drops dramatically.

The ReAct Loop Under the Hood

PageAgent implements a classic ReAct (Reason + Act) agent loop. Each step follows a strict sequence: observe the current page state, think by calling the LLM with reflection, then act by executing the chosen tool. The loop repeats until the task completes or hits the maximum step limit (default: 40).

The "think" phase is interesting. The LLM does not just pick an action. It first generates a reflection: an evaluation of what happened in the previous step, what the current state looks like, and what the next goal should be. This reflection-before-action pattern, baked into the @page-agent/llms package, produces more reliable multi-step behavior than raw tool calling.

{
  "reflection": "The form is now visible. I see email and password fields.
                  I need to fill in the email first.",
  "action": {
    "name": "input_text",
    "args": { "index": 14, "text": "user@example.com" }
  }
}

The available tools are straightforward: click_element_by_index, input_text, select_dropdown_option, scroll, scroll_horizontally, execute_javascript, wait, ask_user, and done. No exotic APIs. Just the primitives you need to operate a web interface.

A circular diagram showing three stages connected by arrows. Observe shows an eye scanning a web page. Think shows a brain with a reflection speech bubble. Act shows a hand clicking a button. The arrows form a continuous loop.
Observe, think, act, repeat. The ReAct loop that powers every PageAgent task.

A Clean Monorepo Architecture

The codebase is a well-structured monorepo with eight packages. Understanding the boundaries matters because they reveal the project's real ambitions.

page-agent is the main npm package. It extends the headless core with a built-in UI panel and ships as both an ESM module and an IIFE bundle for CDN usage.

@page-agent/core contains PageAgentCore, the headless agent class. It owns the ReAct loop, tool definitions, system prompt, and history management. No UI dependency.

@page-agent/llms wraps LLM communication. It uses the OpenAI-compatible API format, handles retry logic, tracks token usage, and implements the reflection-before-action mental model. Any provider that speaks the OpenAI protocol works.

@page-agent/page-controller is the DOM engine. It extracts the live DOM into a FlatDomTree, dehydrates it into simplified text, and exposes async methods like clickElement(), inputText(), and scroll(). An optional SimulatorMask provides visual feedback by highlighting elements as the agent interacts with them.

@page-agent/ui renders the chat panel with i18n support. It is decoupled from the agent via a PanelAgentAdapter interface, which means you could swap in a completely different UI without touching the core logic.

Two more packages round out the system: a Chrome extension for multi-page workflows across tabs and an MCP server for integrating with external tools via the Model Context Protocol.

The DOM Pipeline in Detail

The DOM extraction pipeline, adapted from the browser-use project, is arguably the most technically interesting piece. It works in four stages.

First, extraction. The PageController walks the live DOM and builds a FlatDomTree. This is a flat array representation of the DOM hierarchy that preserves parent-child relationships through index references rather than nesting.

Second, dehydration. The tree is stripped down to only interactive elements: inputs, buttons, links, selects, and anything else a user might click or type into. Each gets a numeric index like [14]. Non-interactive wrapper divs, styling spans, and decorative elements are removed. The output is a compact text string.

Third, LLM processing. The dehydrated DOM text goes into the LLM prompt alongside the system prompt, task description, and agent history. The LLM returns a structured response with both a reflection and an action.

Fourth, execution. The PageController resolves the numeric index back to the original DOM node and fires the appropriate browser event: a real click, a keyboard input, a scroll. The SimulatorMask overlays a visual highlight so the user can see exactly what the agent is doing.

This pipeline is why PageAgent does not need multi-modal models. The DOM already contains all the semantic information a language model needs. Converting it to text is cheaper than converting a screenshot to tokens.

A horizontal assembly line showing four stations. Raw DOM enters from the left as a tangled tree. It passes through a stripping machine that removes non-interactive elements. Clean indexed text enters an LLM brain. An action instruction exits and a hand executes it on the original page.
From tangled DOM to indexed text to LLM reasoning to real browser actions. The full pipeline.

Human in the Loop by Default

PageAgent ships with a chat panel that makes agent actions visible. The user types a natural language command, watches the agent think through each step, sees elements highlight as they are clicked, and gets a final summary when the task completes.

The ask_user tool lets the agent pause and request clarification. If the agent is unsure which dropdown option to select, it can ask. This is not just a safety feature. It makes the agent dramatically more reliable for ambiguous tasks.

The SimulatorMask overlay blocks direct user interaction during automation, preventing conflicts between human and agent clicks. Once the task finishes, control returns to the user seamlessly.

"Pretty UI with human-in-the-loop. Traceability and predictability is more important than success rate."

-- PageAgent AGENTS.md design principle

Use Cases That Make Sense

SaaS AI Copilot. The most obvious use case. If you build a CRM, ERP, or admin panel, you can add PageAgent and immediately give users natural language control over your interface. No backend rewrite. No new API endpoints. The agent operates the existing UI.

Smart Form Filling. Enterprise software is full of 20-field forms across multiple tabs. "Fill in the purchase order for Acme Corp with standard terms" becomes a single sentence instead of minutes of clicking. Perfect for repetitive data entry in back-office systems.

Accessibility. Natural language control turns any web app into an accessible interface. Users who struggle with complex UIs can describe what they want in plain words. Voice command integration becomes trivial when the agent already understands natural language.

Multi-page Workflows. With the optional Chrome extension, the agent can work across browser tabs. Open a spreadsheet in one tab, extract data, switch to a CRM in another tab, and fill in the records. This is where the extension package earns its keep.

Four vignettes arranged in a grid. Top-left shows a SaaS dashboard with a chat bubble. Top-right shows a long form being auto-filled. Bottom-left shows a person speaking to a web page through a speech bubble. Bottom-right shows two browser tabs connected by a bridge with data flowing between them.
Four deployment patterns for PageAgent. The SaaS copilot use case is the killer app.

The Competitive Landscape

PageAgent occupies a unique position. It is not competing directly with Browser Use, Stagehand, or Playwright. Those tools run server-side or as test frameworks. PageAgent runs client-side, inside the page it controls.

Aspect PageAgent Browser Use Stagehand v3 Playwright
Runtime Client-side (in-page JS) Server-side (Python) Server-side (TypeScript) Server-side (multi-lang)
DOM Access Text-based extraction Screenshot + vision CDP + AI helpers Deterministic selectors
LLM Requirement Any text LLM Multi-modal preferred Text or multi-modal None (no AI)
Server Needed No Yes Yes Yes
Integration One script tag pip install + config npm install + config npm install + test suite
Primary Use In-product AI copilot Agent automation AI-assisted testing E2E testing
GitHub Stars 10.7k (6 months) 50k+ (18 months) 15k+ 70k+ (4 years)

The closest conceptual competitor is not a tool but a pattern: the "AI copilot" features shipping inside products like Notion, Linear, and Figma. Those are custom-built for each product. PageAgent offers a generic version that any web app can adopt without writing custom AI integration code.

Lineage and Credit

The project explicitly acknowledges its debt to browser-use. The DOM processing components and prompt design are adapted from that project, which pioneered many of the patterns PageAgent uses.

The key difference is deployment target. Browser-use runs server-side with a headless browser. PageAgent took those same DOM extraction techniques and moved them into the client. It is a case of an idea being transplanted into a fundamentally different environment where it gains new properties: zero infrastructure, direct DOM access, and instant deployment via CDN.

The creator, Simon (gaomeng1900) at Alibaba, has been actively developing the project since September 2025. The Hacker News launch in early March 2026 generated 77 points and 37 comments, with discussion focusing on security implications of in-browser agents and the practicality of text-based DOM manipulation.

A family tree showing browser-use at the top as a server-side ancestor. An arrow labeled adapted points down to PageAgent which sits inside a browser window shape. The browser-use side has server racks and Python logos while the PageAgent side has only a browser tab and a script tag.
From server-side browser-use to client-side PageAgent. Same DOM patterns, fundamentally different deployment.

What to Watch

Several features are marked as TODO in the codebase: send_keys for keyboard shortcuts, upload_file for file inputs, go_back for navigation, and extract_structured_data for pulling information from tables. These additions will expand what the agent can handle.

The MCP server package signals ambition beyond simple page automation. Model Context Protocol integration would let external agents delegate UI tasks to PageAgent, turning it into a browser-action service within a larger agent ecosystem.

Security is the elephant in the room. Running an AI agent with JavaScript execution capabilities inside a user's browser creates real risks. The execute_javascript tool, while powerful, could be exploited through prompt injection if the agent processes untrusted content. The project's own AGENTS.md states that "traceability and predictability is more important than success rate," which is the right principle, but the implementation will need to harden significantly for enterprise adoption.

The Chrome extension for multi-page workflows is still marked as work-in-progress. When it matures, it will bridge the gap between PageAgent's single-page strength and the cross-site automation that server-side tools handle today.

A road stretching into the distance. Milestones along the road are labeled with upcoming features. The road starts from a browser window in the foreground and leads toward a distant horizon where multiple browser tabs are connected together.
The roadmap runs from single-page copilot toward a full multi-page agent platform.

Bottom Line

PageAgent bets that the future of AI web automation is not another heavyweight server-side framework. It bets that the DOM itself, converted to indexed text, is all an LLM needs to control an interface. And it bets that the fastest path to shipping an AI copilot is a single script tag, not a Python backend.

At 10,700 stars in six months with an 800-line core that cleanly separates concerns, the bet is working. If you build a web application and want to add natural language control, PageAgent is currently the shortest path from zero to working copilot.

The question is whether text-based DOM manipulation scales to truly complex interfaces, or whether the screenshot-based approaches will prove necessary for visual reasoning. For now, the answer seems to be that text works surprisingly well for the 80% of tasks that involve forms, buttons, and standard controls. The other 20% remains an open challenge.