Control any web interface with natural language — no browser extension, no headless browser, just a JavaScript script tag.
Page Agent is a TypeScript library that turns natural language instructions into live DOM interactions directly inside your running web page. Instead of spinning up a separate browser automation process, you embed a lightweight script and the agent reasons over the live DOM to click buttons, fill forms, scroll, and navigate — all driven by an LLM of your choosing.
The library is structured as a monorepo of focused packages: a core agent loop, a headless DOM controller, an OpenAI-compatible LLM client, a React UI panel, an MCP server, and a Chrome extension for multi-tab workflows. You can use just the headless core or the full-featured bundle depending on your integration needs.
Page Agent is designed explicitly for client-side web enhancement rather than server-side automation. This means it runs in the user’s own browser context, inheriting their session cookies, authentication state, and real-time page updates — which is particularly powerful for building AI copilots in SaaS products, automating ERP and CRM workflows, or making complex admin interfaces accessible through voice or typed commands.
The project is MIT-licensed, actively maintained by Alibaba and community contributors, and supports any OpenAI-compatible LLM backend including Alibaba Cloud DashScope, local models, and commercial APIs. Integration requires nothing more than an npm install or a CDN script tag.
Architecture
Page Agent is a layered, event-driven monorepo where each concern lives in a dedicated package with no upward dependencies. At the centre sits the ReAct agent loop in @page-agent/core: each step calls observe (DOM snapshot), think (LLM invocation via a single structured MacroTool), and act (tool execution), with the loop terminating on a done action or step-count overflow. The PageController package is completely decoupled from the LLM — it owns all DOM mutation and observation, emitting beforeUpdate and afterUpdate lifecycle events so any consumer can react to DOM changes. The UI panel, Chrome extension, and MCP server each depend on core and page-controller but never on each other, making it possible to use any subset of the stack without pulling in unused dependencies.
Tech Stack The implementation is TypeScript 6 throughout, built with Vite 8 and distributed as both ESM and IIFE bundles. LLM communication goes through a hand-rolled OpenAI-compatible client (no SDK dependency) that converts Zod v4 schemas to OpenAI tool definitions and handles the full response-validation lifecycle. DOM analysis uses a custom flat-tree representation of the live document rather than screenshots, making it compatible with any text-capable LLM. The Chrome extension is built with WXT and React, the UI panel uses Tailwind CSS v4, and runtime validation throughout uses Zod v4 schemas. Vitest powers the test suite across packages, and Husky with commitlint enforces conventional commits on every push.
Code Quality
The codebase has meaningful test coverage across the three core packages — the agent loop, the DOM controller, and the LLM client each have dedicated test files using Vitest with mocked fetch and vi.fn()-based PageController stubs. Error handling is typed and explicit: the LLM client defines a closed InvokeErrorTypes enum (auth, rate-limit, server, context-length, content-filter, tool-execution, etc.) and every error path throws a structured InvokeError with a raw response attached. AbortSignal is threaded from the top-level execute() call through the LLM fetch and into every tool context, enabling cooperative cancellation at any point. TypeScript strict mode is on, ESLint and Prettier are enforced via lint-staged, and CI runs typecheck, lint, and tests.
What Makes It Unique
Unlike Playwright, Puppeteer, or browser-use (the project it builds on), Page Agent operates entirely inside the already-running page context using the live DOM — no separate process, no CDP connection, no screenshot pipeline. The MacroTool pattern is an architecturally distinctive choice: all available agent tools are merged into a single structured JSON schema at runtime, forcing the LLM to produce reflection fields (evaluation, memory, next goal) alongside its chosen action in one atomic tool call. This reduces round-trips and ensures the agent’s reasoning is captured in structured history that both the UI and lifecycle hooks can consume. The optional llms.txt integration and per-URL instruction system let page owners declare agent-friendly hints in a standardised format, enabling a kind of cooperative automation where the web app can guide the agent’s behaviour.
Page Agent is released under the MIT License, which is one of the most permissive open-source licences available. You may use it commercially, modify it freely, distribute it, and embed it in proprietary products without any copyleft obligation to open-source your own code. The only requirement is that the copyright notice and licence text are preserved. There are no enterprise tiers, commercial licences, or usage restrictions in the source code — the repository contains no ee/, pro/, or enterprise/ directories, and there are no runtime licence checks.
Running Page Agent yourself means shipping a JavaScript bundle to your users’ browsers, not operating a server. The operational burden is therefore minimal compared to most self-hosted tools: there is no database to maintain, no background process to keep alive, and no infrastructure to provision beyond a CDN or npm registry. However, you are responsible for securing your LLM API key — the library accepts it as a client-side config option, so you will need a backend proxy or a scoped, rate-limited key strategy to avoid exposing credentials in browser bundles. Keeping pace with upstream releases is straightforward given the project’s active release cadence (roughly weekly), but breaking changes between minor versions should be expected as the API matures.
There is no official hosted or managed version of Page Agent itself — the project does not offer a SaaS tier, cloud dashboard, or support SLA. What you give up compared to commercial alternatives (such as Browser Use Cloud, Lindy, or similar AI automation platforms) is managed uptime, guaranteed model routing, enterprise support contracts, and pre-built connectors. You gain full control over which LLM provider you use, zero data leaving your infrastructure if you run a local model, and the ability to fork and customise the agent loop, tools, and DOM parser to fit your specific application’s interaction patterns.
Automation · Productivity · AI Assistants
Build, deploy, and run autonomous AI agents that automate complex multi-step workflows using a visual block-based graph editor.
Devops · Automation · Security
A cloud-native reverse proxy and load balancer that auto-configures itself from Docker, Kubernetes, and other orchestrators — zero manual routing required.
Developer Tools · Automation · AI Assistants
The all-in-one AI platform for private document chat, no-code agents, and local LLMs with zero setup friction.