Arize Phoenix is an open-source platform designed for debugging, evaluating, and improving LLM-powered applications. It targets AI engineers, MLOps teams, and developers building production-grade AI systems who need visibility into model behavior, prompt performance, and evaluation metrics. Phoenix solves the challenge of opaque LLM workflows by providing end-to-end observability—from tracing individual LLM calls to measuring output quality with automated evaluations.
Built on OpenTelemetry and OpenInference, Phoenix supports auto-instrumentation for major frameworks like LangChain, LlamaIndex, OpenAI Agents, Anthropic, and Vercel AI SDK. It runs locally, in Jupyter notebooks, via Docker, or in the cloud, with a modular architecture that includes Python and TypeScript SDKs for tracing, evaluation, and client-server communication.
What You Get
- Tracing - Capture detailed, step-by-step traces of LLM calls, retrieval steps, tool usage, and custom logic using OpenTelemetry OTLP with auto-instrumentation for LangChain, LlamaIndex, OpenAI, Anthropic, and Bedrock.
- LLM-Based Evaluation - Automatically score LLM outputs using pre-built or custom evaluators for relevance, correctness, and hallucination detection, with integrations for Ragas, Deepeval, and Cleanlab.
- Prompt Management - Version, store, and deploy prompts with tagging and rollback support, enabling systematic testing and collaboration across teams.
- Prompt Playground - Experiment with prompt variants and LLM models side-by-side, replaying real traces to compare outputs before deployment.
- Datasets & Experiments - Create versioned datasets from production traces or CSV uploads, then run experiments to compare model or prompt changes using consistent inputs and automated evaluation metrics.
- Span Replay - Debug LLM behavior by re-running traced spans with modified inputs or parameters to isolate issues without re-executing the full pipeline.
Common Use Cases
- Debugging failed LLM responses in production - An AI engineer uses Phoenix tracing to inspect a failed customer support bot response, identifying that the retrieval step returned irrelevant documents before the LLM generated a hallucinated answer.
- Optimizing prompts for RAG systems - A data scientist compares 5 prompt variants using the Prompt Playground and Dataset Evaluators to find the highest-relevance configuration for a knowledge base QA system.
- Validating model upgrades before deployment - An MLOps team runs an experiment comparing GPT-4-turbo vs. Claude 3 on a production dataset, using automated LLM-based evaluators to confirm improved accuracy before rolling out the new model.
- Scaling prompt testing across teams - A product team uses Phoenix’s Prompt Management to version and share approved prompts across engineering, design, and customer success, ensuring consistency in AI-generated content.
Under The Hood
Architecture
- Modular monorepo structure with clear separation between frontend (React/TypeScript/Relay) and backend (Starlette/SQLAlchemy/Strawberry GraphQL), enabling independent development and scaling
- Backend implements data loader patterns for efficient batched database queries, decoupling business logic from data access via dependency injection
- OpenTelemetry is deeply integrated to auto-instrument LLM frameworks and capture end-to-end trace data for observability
- Frontend uses GraphQL fragments with Relay for type-safe data fetching and React context for dynamic state management
- Extensibility is built-in through plugin-style SDK instrumentation and client-side tracing hooks that dynamically wrap third-party LLM libraries
Tech Stack
- Python 3.10–3.14 backend powered by Starlette and FastAPI with full async support via SQLAlchemy and aiosqlite
- GraphQL API layer built with Strawberry, leveraging incremental execution for performance-critical queries
- OpenTelemetry SDK and instrumentations provide comprehensive tracing for FastAPI, gRPC, SQLALchemy, and LLM providers
- PostgreSQL 16 with Alembic migrations and Dockerized deployment using distroless images for security and reproducibility
- Frontend built with TypeScript, pnpm, and Vite-style pipeline, using CodeMirror and Zustand for rich interactive experiences
- CI/CD pipeline enforces quality with uv, tox, pre-commit hooks, and automated versioning
Code Quality
- Extensive test coverage across unit, integration, and end-to-end layers with Vitest and Playwright for robust validation
- Strong type safety enforced throughout via TypeScript, generated GraphQL types, and environment variable validation with type guards
- Clean, modular code organization with dedicated test suites for each layer and comprehensive error handling
- Automated quality gates with Oxlint and Oxcfmt ensure consistent style and reduce technical debt
- Sophisticated mocking and test isolation techniques enable deterministic testing of async and external service interactions
What Makes It Unique
- Deep GraphQL-Relay integration enables type-safe, incremental data fetching that eliminates manual data synchronization in complex UIs
- Bidirectional type consistency between Python backend and React frontend through auto-generated schemas and hooks
- Dynamic evaluator input injection via React context allows real-time adaptation of evaluation criteria based on live LLM outputs
- Custom prompt versioning with unified provider bindings ensures auditability and prevents configuration drift
- Visualized alphabetic indexing with D3-powered color schemes provides intuitive categorization of model outputs
- Developer experience enhanced by monorepo-grade tooling (Oxlint, Oxc, pre-commit) in a single-project context