Arize Phoenix
Open-source AI observability platform for tracing, evaluating, and debugging LLM applications with built-in intelligence and MCP support.
Arize Phoenix is an open-source AI observability platform that gives AI engineers complete visibility into LLM applications from trace collection through evaluation and optimization. Built on OpenTelemetry and the OpenInference semantic conventions, it auto-instruments popular frameworks and LLM providers so you get detailed traces without rewriting your code.
Phoenix combines runtime observability with structured experimentation: you can capture traces in production, slice them into versioned datasets, run prompt or model swap experiments against those datasets, and score every result with LLM-based evaluators—all in one platform. The Prompt Playground lets you replay real traced calls with modified parameters before committing changes, shortening the feedback loop from days to minutes.
As of mid-2026, Phoenix ships with Phoenix Intelligence (PXI), an AI engineering agent embedded directly in the UI that can debug traces, suggest prompt improvements, and navigate the product on your behalf. It also exposes an MCP server (@arizeai/phoenix-mcp) so external tools like Cursor and Claude Desktop can query your traces, prompts, datasets, and experiments through the Model Context Protocol.
Deployment is flexible: install as a Python package (pip install arize-phoenix), pull the Docker image from Docker Hub, deploy via Helm on Kubernetes, or use Arize’s managed cloud at app.phoenix.arize.com. Sub-packages (arize-phoenix-otel, arize-phoenix-client, arize-phoenix-evals) let lightweight agents ship telemetry without pulling in the full platform.
What You Get
- OpenTelemetry-based tracing - Capture end-to-end traces of LLM calls, RAG retrievals, tool invocations, and agent steps using OTLP with auto-instrumentation for 20+ frameworks including LangChain, LlamaIndex, OpenAI Agents SDK, Claude Agent SDK, LangGraph, Vercel AI SDK, CrewAI, and DSPy.
- LLM-powered evaluation - Automatically score outputs for relevance, correctness, hallucination, toxicity, and custom criteria using built-in evaluators or your own, with integrations for Ragas, Deepeval, and Cleanlab.
- Prompt management - Version, tag, and deploy prompts with rollback support; test prompt changes systematically in the Playground before pushing to production.
- Datasets and experiments - Create versioned datasets from production traces or CSV uploads, then run reproducible experiments to compare prompt variants, model upgrades, or retrieval strategies using consistent inputs and automated scoring.
- Phoenix Intelligence (PXI) - An AI engineering agent embedded in the Phoenix UI that can autonomously debug failing traces, iterate on prompt revisions, and navigate the platform, itself traced by Phoenix’s own instrumentation.
- MCP server - The
@arizeai/phoenix-mcppackage exposes Phoenix’s traces, spans, prompts, datasets, and experiments through the Model Context Protocol so Cursor, Claude Desktop, and other MCP clients can query your observability data directly. - Sessions and multi-turn tracking - Group related traces into sessions with the Sessions API to observe multi-turn conversation flows, annotate at the session level, and analyze conversation-level metrics.
- Span Replay and Prompt Playground - Re-run individual traced spans with modified inputs or parameters to isolate regressions; compare multiple models or prompt variants side-by-side using real production examples.
Common Use Cases
- Diagnosing a silent RAG failure - An AI engineer notices user satisfaction dropping; Phoenix traces reveal the retrieval step is returning stale embeddings before the LLM generates a confident but incorrect answer, pinpointing the exact span to fix.
- Validating a model upgrade before rollout - An MLOps team loads 500 production traces into a versioned dataset, runs an experiment comparing GPT-4o to Claude Sonnet on the same inputs, and uses LLM evaluators to confirm the new model improves answer relevance before switching traffic.
- Iterating on prompt quality with PXI - A developer uses the embedded Phoenix Intelligence agent to automatically suggest and test revised system prompts on failing trace examples, cutting iteration time from hours to minutes.
- Querying traces from an IDE via MCP - A team configures the Phoenix MCP server in Cursor so engineers can ask natural-language questions about recent traces, pull failing span details, and inspect prompt versions without switching contexts.
- Tracking multi-turn conversation quality - A customer support AI team uses the Sessions API to group traces by conversation, then applies session-level evaluators to catch cases where the assistant loses user context across turns.
- Standardizing prompts across teams - A platform organization uses Phoenix Prompt Management to version and tag approved prompt templates, ensuring all downstream services reference the same canonical prompt versions rather than drifting independently.
Under The Hood
Architecture Phoenix is structured as a monorepo with a clean Python backend and a TypeScript React frontend, each independently versioned and deployed but tightly integrated through a Strawberry GraphQL API layer. The backend follows a layered design: an async Starlette and FastAPI HTTP tier handles REST and GraphQL endpoints, a service layer encapsulates business logic, and a SQLAlchemy async data layer abstracts over SQLite for local use and PostgreSQL for production deployments. DataLoader patterns batch expensive database queries to prevent N+1 problems in GraphQL resolvers. Phoenix also ships a pydantic-ai powered agent runtime (PXI) built directly into the server, with capability-based composition that wires tool access, web search, documentation lookups, and sub-agent calls through a unified context dependency injection system.
Tech Stack The backend is Python 3.10 through 3.14 with Starlette, FastAPI, and Strawberry GraphQL, using SQLAlchemy with asyncio support and Alembic for migrations across both SQLite and PostgreSQL 16. Observability is built on OpenTelemetry SDK and OpenInference instrumentation packages. Authentication integrates OAuth2 via Authlib, LDAP3 for enterprise SSO, and JWT token management with joserfc. The PXI agent runtime uses pydantic-ai with direct support for Anthropic, OpenAI, Bedrock, and Google providers. The frontend is a React TypeScript app built with Vite, using Relay for type-safe GraphQL data fetching, Zustand for state management, and CodeMirror for rich editing surfaces. Container images use distroless Debian 13 base images for minimal attack surface, and Helm charts are provided for Kubernetes deployments.
Code Quality The codebase maintains extensive test coverage across unit, integration, and end-to-end layers with pytest on the backend and Vitest and Playwright on the frontend. Type safety is enforced throughout: Python uses strict typing with Pydantic v2 for data validation and SQLAlchemy mapped column types, while TypeScript runs in strict mode with auto-generated GraphQL types ensuring frontend-backend contract fidelity. CI enforces quality through uv for dependency management, tox for test matrix execution, pre-commit hooks, oxlint for fast JavaScript linting, and automated release versioning via release-please. Error handling is explicit throughout the GraphQL layer with typed error unions and custom exception hierarchies.
What Makes It Unique Phoenix’s deepest differentiator is collapsing the observability-to-experimentation loop into a single platform: production traces flow directly into versioned datasets, which feed experiments that are automatically scored by the same LLM evaluators used in production monitoring, creating a closed feedback cycle. The embedded PXI agent—itself instrumented by Phoenix’s own tracing—adds a layer of AI-assisted debugging that no other open-source observability tool provides. The MCP server integration turns Phoenix’s data store into an active participant in developer toolchains, letting IDE agents and AI assistants query live observability state rather than exporting data to external systems.
Self-Hosting
Arize Phoenix is released under the Elastic License 2.0 (ELv2). ELv2 is source-available rather than open source in the OSI sense: you can use it freely for personal projects, internal tooling, and building your own AI applications, and you can modify and redistribute the code. The key restriction is that you may not offer Phoenix itself as a hosted or managed service to third parties—meaning you cannot build and sell a Phoenix-as-a-Service product. For the vast majority of self-hosters—teams running Phoenix inside their own infrastructure to observe their own applications—ELv2 imposes no practical restrictions.
Running Phoenix yourself is operationally realistic for small to mid-size teams. For development and evaluation, a single pip install arize-phoenix and phoenix serve is sufficient. For production you will want a PostgreSQL 16 database (SQLite is supported but not recommended under concurrent write loads), a persistent volume for trace storage, and either Docker or a Kubernetes deployment via the provided Helm chart. Phoenix exposes Prometheus metrics natively so it integrates with existing monitoring stacks. The main operational burden is database management: you own backups, schema migrations via Alembic, and scaling the database under high trace ingestion volumes. Arize publishes Docker images on Docker Hub and maintains Helm charts, so upgrades are straightforward, though you bear responsibility for timing them against your own deployment cadence.
Arize AI offers a managed cloud version of Phoenix at app.phoenix.arize.com that removes the infrastructure burden entirely. The cloud tier handles uptime, database management, automatic upgrades, and provides enterprise support SLAs, SSO via LDAP or OAuth2, and team collaboration features. Self-hosters get full feature parity on core observability and evaluation capabilities but give up managed backups, guaranteed uptime SLAs, and access to Arize’s support team. For teams whose primary concern is avoiding data egress or meeting strict data residency requirements, self-hosting on ELv2 is a sound choice; for teams that want zero ops overhead, the managed cloud is the better path.
Related Apps
Uptime Kuma
Monitoring
Self-hosted monitoring for every service you run — 23 monitor types, 95 notification channels, live dashboards, and public status pages with no vendor lock-in.
Uptime Kuma
MITCaddy
Devops · Security
The only web server that obtains and renews TLS certificates automatically, with HTTP/1-2-3 support and zero dependency on external runtimes.
Caddy
Apache 2.0OpenBB
Databases · Analytics · Invoicing Finance
The AI Workspace for Finance: Connect Data, Run AI Agents, Build Analytics