understudy

Name: understudy
Rating: 5 (445 reviews)

Open-source local AI agent that operates your entire computer — GUI, browser, shell, and messaging — from a single instruction, using your own models.

445stars

33forks

MIT License

TypeScript

View Source Visit Website

On This Page

Understudy is a local-first, open-source AI agent runtime for macOS that gives you a single instruction-driven interface for controlling every surface of your computer. Rather than forcing you to live inside a web app or a proprietary SaaS product, Understudy runs on your machine and dispatches tasks through GUI automation, managed Playwright browser sessions, a bash shell, semantic web search, and eight built-in messaging channel adapters — all from the same agent loop.

What distinguishes Understudy from comparable projects is its teach-by-demonstration pipeline. You show it a task once, and the system extracts the intent rather than the pixel coordinates, producing a generalized skill that survives UI redesigns, window resizing, and application switches. That skill can subsequently be invoked with plain natural language and will automatically upgrade deterministic steps (file downloads, browser navigation) to faster execution routes while keeping genuinely complex steps agentic.

For longer pipelines, Understudy introduces a three-artifact composition model: Skills are agentic sub-tasks that make their own decisions within quality gates; Workers are deterministic scripted subtasks that follow a fixed sequence and emit structured output; and Playbooks orchestrate workers and skills as independent child sessions, each with their own context window. This lets a single pipeline sequence both scripted reliability and genuine autonomy without colliding context windows.

Understudy is model-agnostic by design. It ships with adapters for OpenAI, Anthropic Claude, Google Gemini, and other providers, and requires only your own API keys — no subscription, no data sent to a vendor-managed cloud, no vendor lock-in. The runtime is published as an npm package and controlled from the command line.

What You Get

Full desktop automation — GUI automation with screenshot grounding and native macOS input injection across any desktop application
Dual browser modes — managed Playwright sessions for clean automation and a Chrome extension relay for attaching to your real browser profile with existing sessions
Teach by demonstration — record a task once, have the agent extract the intent and generate a reusable skill that generalizes across UI changes and different apps
8 messaging channel adapters — built-in integrations for Telegram, Slack, Discord, WhatsApp, Signal, LINE, iMessage, and Web for remote dispatch and notifications
Three-artifact composition system — Playbooks, Workers, and Skills compose into multi-stage pipelines where scripted determinism and agentic autonomy operate side by side
Policy pipeline and trust engine — configurable allow/deny/require-approval policies per tool call, with rate-limiting and read-only auto-approval
Bring-your-own-model — works with OpenAI, Anthropic, Gemini, and other providers; no vendor-managed cloud required
Workflow crystallization — the runtime accumulates execution history and can harden successful paths into reusable skills automatically

Common Use Cases

Cross-app desktop automation — send a file from a desktop application, convert it, and deliver it via a messaging app, all triggered by a single phone message
Recurring research workflows — instruct the agent once to monitor a website, extract structured information, and post a summary to a Slack channel on a schedule
Teach-and-replay personal workflows — demonstrate a repetitive image editing and export sequence once, then replay it across dozens of files with natural language variations
Multi-stage content pipelines — compose a playbook that browses an app store, installs on a real device via iPhone Mirroring, explores autonomously, edits a video, and uploads to YouTube with zero human intervention
Remote computer control from mobile — dispatch tasks to your Mac from a phone through Telegram or WhatsApp and receive results without needing a VPN or remote desktop client
Local LLM automation without cloud risk — run sensitive document processing or personal workflows entirely locally, keeping data off vendor servers while using frontier models via your own keys

Under The Hood

Architecture Understudy is organized as a pnpm monorepo with a clearly layered package structure: a core package owns the session orchestrator, tool registry, trust engine, system prompt builder, and workflow crystallization pipeline; a gui package encapsulates the native macOS GUI runtime with screenshot grounding; a gateway package exposes a WebSocket and REST API so external clients and messaging channels can drive sessions; and a channels package contains per-messenger adapter modules. At runtime, the orchestrator assembles tools from the registry, wraps them in a policy pipeline and watchdog, builds a system prompt from declarative parameter sections, and delegates execution to a swappable runtime adapter — either an embedded in-process adapter or an ACP protocol adapter for external runtimes. The three-artifact composition model (Skill / Worker / Playbook) is a genuine architectural boundary: playbooks spawn each stage as an independent child session with its own context window, preventing context bleed between long pipeline stages while preserving structured output contracts between them.

Tech Stack The entire codebase is TypeScript on Node.js 20+ using ESM throughout, bundled with esbuild and type-checked with project references. The agent intelligence layer is provided by the @mariozechner/pi-agent-core and @mariozechner/pi-ai packages, which supply the provider-agnostic model abstraction and multi-provider API adapters. GUI automation uses native macOS input tooling supplemented by screenshot-based grounding. Browser automation is built on Playwright with an additional Chrome extension relay for attaching to live browser sessions. Messaging adapters use GrammY for Telegram, discord.js for Discord, Baileys for WhatsApp, and the Slack Bolt SDK; all are listed as optional dependencies so the core installs lean. Configuration and skill definitions use YAML frontmatter parsed at runtime. Testing runs on Vitest with V8 coverage.

Code Quality The repository has extensive test coverage across unit, integration, and end-to-end layers — over 140 test files spanning CLI commands, gateway workflows, GUI grounding, scheduling, browser automation, and crystallization pipelines, with separate synthetic and live E2E modes. TypeScript strict mode is enforced via tsconfig project references. Linting uses oxlint. CI runs on GitHub Actions. Error handling is explicit throughout: policy evaluation returns typed PolicyDecision unions; the trust engine logs rate-limit violations rather than swallowing them silently; tool results pass through a context guard that detects and recovers from context overflow. Inline documentation is moderate — key abstractions like TrustEngine, RuntimePolicyPipeline, and WorkflowCrystallization have JSDoc-level comments, while lower-level utility modules are largely self-documenting through clear naming.

What Makes It Unique The most technically novel aspect of Understudy is its intent-extraction teach pipeline combined with route-optimization crystallization. When a user demonstrates a workflow, the system does not record pixel coordinates or a replay script — it uses the agent to extract the semantic intent of each step and annotate the execution route (GUI, browser automation, shell). On replay, steps are re-evaluated: if a GUI drag-and-drop can be replaced by a shell command or API call, the agent substitutes the faster route automatically. Combined with the three-artifact composition model — where a playbook can mix strictly scripted workers with genuinely agentic skills in the same pipeline — this gives teams a path to progressively harden automation without rewriting workflows from scratch. The eight-channel messaging dispatch system, treating messaging apps as a first-class trigger and notification surface rather than a bolt-on integration, is also uncommon in open-source local agent runtimes.

Self-Hosting

Understudy is released under the MIT License, which is one of the most permissive open-source licenses available. You are free to use, modify, distribute, and incorporate it into commercial products without restriction, with no copyleft requirements that would affect your own code. The only obligation is to retain the copyright and license notice in any distribution. There are no dual-licensing tiers, no license keys, and no enforcement mechanisms in the codebase.

Running Understudy yourself means taking on the full operational responsibility for the host macOS machine. The agent has broad system access by design — it can control any desktop application, run arbitrary shell commands, send messages through your personal messaging apps, and manage browser sessions. You are responsible for configuring the policy pipeline (allow/deny/require-approval rules) to match your risk tolerance, keeping Node.js and optional dependencies patched, and securing the API keys for whichever LLM provider you use. There is no persistent server process to manage beyond keeping the CLI available; sessions are ephemeral by default, though the workflow crystallization database accumulates on disk over time and is your responsibility to back up.

Because Understudy has no hosted cloud tier, there is no managed upgrade path, no SLA, and no vendor support channel beyond the community Discord and GitHub Issues. You do not get uptime monitoring, automated backups of your crystallized skills, or an operations team on call. The project is early — Layer 5 (proactive autonomy) is not yet implemented, and Layers 3 and 4 are partially complete — so expect API churn between versions. The trade-off is complete data sovereignty: no usage telemetry is sent to the Understudy project, and your conversation history and trained skills remain entirely on your local machine.

Related Apps

TypeScript

91%

Other

n8n

Automation · No Code Platforms

195,219

Code when you need it, UI when you don't — the workflow automation platform built for technical teams who refuse to choose.

View details

n8n

claw-code

AI Agents · AI Code Assistants

194,567

A Rust-built CLI agent harness for Claude AI with persistent sessions, MCP tool integration, plugin hooks, and multi-provider support — designed to run autonomous coding workflows without human babysitting.

View details