page-agent

Name: page-agent
Rating: 5 (23289 reviews)

Control any web interface with natural language — no browser extension, no headless browser, just a JavaScript script tag.

23.3Kstars

2Kforks

MIT License

TypeScript

View Source Visit Website

On This Page

Page Agent is a TypeScript library that turns natural language instructions into live DOM interactions directly inside your running web page. Instead of spinning up a separate browser automation process, you embed a lightweight script and the agent reasons over the live DOM to click buttons, fill forms, scroll, and navigate — all driven by an LLM of your choosing.

The library is structured as a monorepo of focused packages: a core agent loop, a headless DOM controller, an OpenAI-compatible LLM client, a React UI panel, an MCP server, and a Chrome extension for multi-tab workflows. You can use just the headless core or the full-featured bundle depending on your integration needs.

Page Agent is designed explicitly for client-side web enhancement rather than server-side automation. This means it runs in the user’s own browser context, inheriting their session cookies, authentication state, and real-time page updates — which is particularly powerful for building AI copilots in SaaS products, automating ERP and CRM workflows, or making complex admin interfaces accessible through voice or typed commands.

The project is MIT-licensed, actively maintained by Alibaba and community contributors, and supports any OpenAI-compatible LLM backend including Alibaba Cloud DashScope, local models, and commercial APIs. Integration requires nothing more than an npm install or a CDN script tag.

What You Get

A TypeScript-first ReAct agent loop (observe → think → act) with configurable max steps, step delay, and lifecycle hooks (onBeforeTask, onAfterTask, onBeforeStep, onAfterStep)
A DOM controller that builds a flattened, indexed tree of interactive elements and serialises it as simplified HTML for LLM consumption — no screenshots, no multimodal models required
An OpenAI-compatible LLM client with typed error categories (auth, rate-limit, server, context-length, content-filter), automatic retry with exponential back-off, and tool-call normalisation
A built-in React UI panel (floating widget) you can drop into any SPA with zero configuration, showing real-time agent thinking, execution steps, and history
An MCP server package that exposes Page Agent’s browser-control capabilities to external agent clients (e.g. Claude Desktop, Cursor)
A Chrome extension for cross-tab multi-page automation, enabling the agent to coordinate actions across different browser tabs via a remote PageController protocol
Extensible tool system: add custom tools or null-out built-in ones (click, input, scroll, select, wait, ask_user, execute_javascript) through the customTools config map

Common Use Cases

SaaS AI copilot — embed Page Agent in a B2B product so end-users can describe what they want in plain English and the agent navigates the UI on their behalf, reducing onboarding friction
ERP and CRM automation — replace 20-click workflows with a single sentence; the agent fills multi-step forms, selects dropdowns, and submits data in complex enterprise admin interfaces
Accessibility layer — make any legacy web app voice-navigable or screen-reader-friendly by accepting natural language commands and translating them to precise DOM interactions
QA and exploratory testing — drive user journeys in a real browser session with live auth state, avoiding the credential-management overhead of headless test setups
Multi-tab research workflows — use the Chrome extension to let an agent open, read, and interact with multiple pages autonomously, aggregating results back to the originating page
Developer tooling via MCP — connect an AI coding assistant to Page Agent’s MCP server to control the browser directly from the editor or agent pipeline

Under The Hood

Architecture Page Agent is a layered, event-driven monorepo where each concern lives in a dedicated package with no upward dependencies. At the centre sits the ReAct agent loop in @page-agent/core: each step calls observe (DOM snapshot), think (LLM invocation via a single structured MacroTool), and act (tool execution), with the loop terminating on a done action or step-count overflow. The PageController package is completely decoupled from the LLM — it owns all DOM mutation and observation, emitting beforeUpdate and afterUpdate lifecycle events so any consumer can react to DOM changes. The UI panel, Chrome extension, and MCP server each depend on core and page-controller but never on each other, making it possible to use any subset of the stack without pulling in unused dependencies.

Tech Stack The implementation is TypeScript 6 throughout, built with Vite 8 and distributed as both ESM and IIFE bundles. LLM communication goes through a hand-rolled OpenAI-compatible client (no SDK dependency) that converts Zod v4 schemas to OpenAI tool definitions and handles the full response-validation lifecycle. DOM analysis uses a custom flat-tree representation of the live document rather than screenshots, making it compatible with any text-capable LLM. The Chrome extension is built with WXT and React, the UI panel uses Tailwind CSS v4, and runtime validation throughout uses Zod v4 schemas. Vitest powers the test suite across packages, and Husky with commitlint enforces conventional commits on every push.

Code Quality The codebase has meaningful test coverage across the three core packages — the agent loop, the DOM controller, and the LLM client each have dedicated test files using Vitest with mocked fetch and vi.fn()-based PageController stubs. Error handling is typed and explicit: the LLM client defines a closed InvokeErrorTypes enum (auth, rate-limit, server, context-length, content-filter, tool-execution, etc.) and every error path throws a structured InvokeError with a raw response attached. AbortSignal is threaded from the top-level execute() call through the LLM fetch and into every tool context, enabling cooperative cancellation at any point. TypeScript strict mode is on, ESLint and Prettier are enforced via lint-staged, and CI runs typecheck, lint, and tests.

What Makes It Unique Unlike Playwright, Puppeteer, or browser-use (the project it builds on), Page Agent operates entirely inside the already-running page context using the live DOM — no separate process, no CDP connection, no screenshot pipeline. The MacroTool pattern is an architecturally distinctive choice: all available agent tools are merged into a single structured JSON schema at runtime, forcing the LLM to produce reflection fields (evaluation, memory, next goal) alongside its chosen action in one atomic tool call. This reduces round-trips and ensures the agent’s reasoning is captured in structured history that both the UI and lifecycle hooks can consume. The optional llms.txt integration and per-URL instruction system let page owners declare agent-friendly hints in a standardised format, enabling a kind of cooperative automation where the web app can guide the agent’s behaviour.

Self-Hosting

Page Agent is released under the MIT License, which is one of the most permissive open-source licences available. You may use it commercially, modify it freely, distribute it, and embed it in proprietary products without any copyleft obligation to open-source your own code. The only requirement is that the copyright notice and licence text are preserved. There are no enterprise tiers, commercial licences, or usage restrictions in the source code — the repository contains no ee/, pro/, or enterprise/ directories, and there are no runtime licence checks.

Running Page Agent yourself means shipping a JavaScript bundle to your users’ browsers, not operating a server. The operational burden is therefore minimal compared to most self-hosted tools: there is no database to maintain, no background process to keep alive, and no infrastructure to provision beyond a CDN or npm registry. However, you are responsible for securing your LLM API key — the library accepts it as a client-side config option, so you will need a backend proxy or a scoped, rate-limited key strategy to avoid exposing credentials in browser bundles. Keeping pace with upstream releases is straightforward given the project’s active release cadence (roughly weekly), but breaking changes between minor versions should be expected as the API matures.

There is no official hosted or managed version of Page Agent itself — the project does not offer a SaaS tier, cloud dashboard, or support SLA. What you give up compared to commercial alternatives (such as Browser Use Cloud, Lindy, or similar AI automation platforms) is managed uptime, guaranteed model routing, enterprise support contracts, and pre-built connectors. You gain full control over which LLM provider you use, zero data leaving your infrastructure if you run a local model, and the ability to fork and customise the agent loop, tools, and DOM parser to fit your specific application’s interaction patterns.

On This Page

Repository Health

Pre-computed score based on development activity, maintenance, community, maturity, and trend momentum.

83/100Excellent

Development Activity96

Maintenance100

Community64

Maturity32

Momentum40

Growing community supportVery active developmentWell-maintained with consistent updatesRapidly growing project

Technical Analysis

83/100Excellent

Architecture88

Code Quality82

Innovation85

Learning Curve75

Repository Stats

Contributors

Total Commits

1,085

Monthly Commits

Watchers

Repo Age

9 months

Last Commit

3 days ago

Built With

TypeScript82.5%

JavaScript11.8%

Recent Releases

34 total

~3.6 releases/month

Topics

agent ai ai-agents browser-automation javascript mcp typescript web

Related Apps

TypeScript

91%

Other

n8n

Automation · No Code Platforms

195,219

Code when you need it, UI when you don't — the workflow automation platform built for technical teams who refuse to choose.

View details

n8n

claw-code

AI Agents · AI Code Assistants

194,567

A Rust-built CLI agent harness for Claude AI with persistent sessions, MCP tool integration, plugin hooks, and multi-provider support — designed to run autonomous coding workflows without human babysitting.

View details