Arize Phoenix is an open-source AI observability platform designed to help developers and ML engineers trace, evaluate, and troubleshoot LLM-powered applications. It addresses the black-box nature of large language models by providing end-to-end visibility into LLM workflows—from prompt execution and retrieval to model responses and performance metrics. Built for practicality, Phoenix supports popular frameworks like LangChain, LlamaIndex, and Haystack, and integrates with major LLM providers including OpenAI, Anthropic, Bedrock, and MistralAI. Whether you’re debugging a failed RAG pipeline or optimizing prompt iterations, Phoenix gives you the tools to iterate confidently with data-driven insights.
Phoenix is vendor- and language-agnostic, running locally in Jupyter notebooks, as a Docker container, or in production cloud environments. It combines tracing, evaluation, dataset versioning, and prompt management into a unified interface, eliminating the need for fragmented tooling. The platform is designed for teams building production-grade AI applications who need to monitor performance, validate outputs, and manage changes systematically.
What You Get
- LLM Tracing - Capture and visualize end-to-end traces of LLM calls using OpenTelemetry-based instrumentation with out-of-the-box support for LangChain, LlamaIndex, Haystack, DSPy, and smolagents. Trace prompts, embeddings, retrievals, and model responses with automatic context correlation.
- LLM Evaluation - Automate evaluation of LLM outputs using built-in metrics like answer relevance, retrieval precision, and faithfulness. Leverage LLMs themselves to benchmark performance against ground truth without manual annotation.
- Versioned Datasets - Create, store, and version datasets of prompts, responses, and embeddings for consistent experimentation and fine-tuning. Track dataset lineage across experiments to ensure reproducibility.
- Experiments Tracking - Compare different prompts, models, or retrieval configurations side-by-side with quantitative metrics. Measure impact of changes over time and identify regressions before deployment.
- Prompt Engineering Playground - Iteratively test prompts, adjust parameters (temperature, max_tokens), and replay traced LLM calls in an interactive interface. See real-time results and compare variants without re-running code.
- Prompt Management - Systematically manage prompt versions with tagging, branching, and rollback capabilities. Link prompts to specific experiments and datasets for auditability and collaboration.
- OpenTelemetry Integration - Instrument LLM applications with minimal code changes using arize-phoenix-otel. Auto-instrument traces via OpenInference and send data to Phoenix for visualization without modifying application logic.
- Multi-Cloud & Local Deployment - Run Phoenix on your local machine via Docker or Jupyter, or deploy to Kubernetes using Helm charts. Supports cloud-hosted instances at app.phoenix.arize.com for team collaboration.
Common Use Cases
- Building a RAG-based customer support assistant - Track how different retrievers impact answer quality, evaluate hallucinations in responses, and version prompt templates to ensure consistent performance across user queries.
- Optimizing a multi-model LLM pipeline - Compare OpenAI GPT-4, Anthropic Claude 3, and Mistral-7B outputs for the same prompt using Phoenix’s evaluation metrics to select the best model per use case.
- Debugging a failed production LLM workflow - Replay a traced session to identify where the retrieval step returned irrelevant documents, then test new embedding models or chunking strategies in the Playground before redeploying.
- DevOps teams managing LLM deployments - Use Phoenix to monitor model performance drift, track prompt version changes across environments (dev/staging/prod), and automate evaluation gates before deploying new prompts to production.
Under The Hood
Phoenix is an open-source observability and evaluation platform tailored for large language model (LLM) applications, offering tools to monitor, debug, and improve generative AI systems. It combines a modern full-stack architecture with deep integration into LLM-specific workflows, enabling developers to trace, annotate, and evaluate their AI applications effectively.
Architecture
Phoenix follows a modular, multi-layered architecture that cleanly separates frontend UI, backend services, and data processing components. It leverages GraphQL for API interactions and supports both web and CLI-based workflows.
- Clear separation between frontend (React/TypeScript) and backend (Python/GraphQL) with well-defined data flows
- Modular component structure that enables extensibility and reuse across different AI application types
- Strong emphasis on observability with comprehensive tracing and annotation support for LLMs
Tech Stack
Phoenix is built using a modern full-stack JavaScript/TypeScript frontend and Python backend, with extensive use of GraphQL and relational databases.
- Python-based backend using FastAPI and SQLAlchemy for robust data handling and API services
- React/TypeScript frontend with Relay for efficient GraphQL data fetching and state management
- Comprehensive ecosystem including OpenTelemetry integration, Docker support, and extensive testing frameworks
- Strong focus on developer experience with extensive documentation, examples, and contribution guidelines
Code Quality
Phoenix demonstrates solid code quality with consistent patterns, comprehensive testing, and clear architectural boundaries.
- Extensive test coverage including unit, integration, and end-to-end tests across frontend and backend components
- Strong type safety with TypeScript and Pydantic models ensuring data consistency and validation
- Consistent code style and linting practices with pre-commit hooks and automated formatting
- Comprehensive error handling and boundary condition management across all system layers
What Makes It Unique
Phoenix stands out through its specialized focus on LLM observability and evaluation, offering unique capabilities for monitoring and improving generative AI systems.
- Deep integration with LLM evaluation frameworks including custom evaluator creation and semantic scoring
- Comprehensive annotation and labeling system designed specifically for LLM debugging and improvement workflows
- Unique support for multi-modal tracing and dataset versioning tailored for generative AI use cases
- Extensive documentation and cookbook-style examples that bridge the gap between theory and practical implementation