Pezzo
Open-source LLMOps platform for prompt management, AI observability, intelligent caching, and real-time cost tracking across LLM providers.
Pezzo is a fully open-source, developer-first LLMOps platform that centralizes every aspect of working with large language models in production. It solves the fragmentation problem teams face when managing prompts across environments, tracking costs, debugging LLM failures, and collaborating on AI workflows — pulling all of these concerns into a single, cohesive platform.
At its core, Pezzo provides version-controlled prompt management with SHA-based versioning and environment-specific publishing, so prompts can be promoted from development to staging to production the same way you’d manage code. Paired with intelligent Redis-based caching, Pezzo intercepts LLM requests through its proxy layer and returns cached responses — dramatically reducing API costs and latency on repeated or similar requests.
Observability is built into the platform at the infrastructure level rather than bolted on after the fact. Every LLM request, whether made through the Node.js client, Python SDK, or LangChain integration, is automatically traced with full token counts, cost calculations, latency measurements, and metadata. This data flows into ClickHouse for analytical queries, enabling teams to build dashboards and set up alerts without manually instrumenting their code.
Pezzo is deployed as a Docker Compose stack combining NestJS, PostgreSQL, ClickHouse, Redis, and Supertokens. The separation between the main API server and a dedicated proxy service means teams can route LLM traffic through Pezzo with minimal changes to existing code — often just a URL swap.
What You Get
- Prompt Management - Version-controlled prompts with SHA-based versioning, environment-specific publishing (dev/staging/prod), rollback capabilities, and a visual editor for iterating on prompt content and model settings.
- Real-Time Observability - Automatic tracing of every LLM request with full metadata including token counts, cost per execution, latency, environment, provider, and model — stored in ClickHouse for fast analytical querying.
- Intelligent Response Caching - Redis-backed cache layer that intercepts identical or semantically matching requests through the proxy, returning cached responses to reduce OpenAI API costs by up to 90% and cut latency significantly.
- Transparent Proxy Service - A dedicated Express-based proxy service that sits in front of the OpenAI API, requiring only a URL change in existing code to route traffic through Pezzo for automatic reporting and caching.
- Multi-Language Client SDKs - Official Node.js and Python client libraries plus LangChain integration, enabling teams to connect to Pezzo regardless of their AI stack with minimal boilerplate.
- GraphQL API - Fully typed GraphQL API powered by Apollo Server and NestJS, enabling programmatic prompt management, metrics retrieval, credential management, and custom tooling integrations.
- Cost Analytics Dashboard - Aggregated view of token usage and dollar costs broken down by environment, model, provider, and time period — surfacing exactly where AI spend is going.
- Prompt Testing Playground - Built-in testing interface to run prompts against live provider APIs from within the console, compare outputs across versions, and validate changes before deployment.
Common Use Cases
- Reducing LLM API costs in production - A team running a GPT-4-powered support bot routes all requests through Pezzo’s proxy to cache repeated questions, dropping their monthly OpenAI bill by 60-80% without changing application code.
- Debugging AI regressions after prompt changes - An engineering team uses Pezzo’s request history to compare token counts and response quality before and after a prompt update, quickly identifying which version change caused a quality drop.
- Collaborating on prompt iterations across environments - A product team uses Pezzo’s versioning system to test new prompt variants in staging, review diff comparisons, and promote the best-performing version to production with a single click.
- Adding LLM observability to a LangChain pipeline - A data scientist integrates Pezzo’s Python client with their LangChain RAG pipeline to automatically record retrieval context, model responses, and cost per chain execution without custom logging.
- Auditing AI usage for enterprise compliance - A compliance team uses Pezzo’s ClickHouse-backed reports to produce a complete audit trail of every LLM interaction, with metadata about who made the request, from which environment, and at what cost.
- Self-hosting an AI operations platform - A startup deploys the full Pezzo stack via Docker Compose on their own infrastructure, maintaining full data residency and control while avoiding per-seat SaaS pricing.
Under The Hood
Architecture Pezzo follows a service-oriented monorepo architecture managed by Nx, with three independently deployable applications: a NestJS API server, a React/Next.js console frontend, and an Express-based proxy service. The server is organized into NestJS feature modules — prompts, reporting, credentials, caching, identity, metrics, analytics — each with clean boundaries enforced by Nx module boundary rules. The proxy service is intentionally thin, acting as a transparent pass-through that intercepts OpenAI API calls, checks the Redis cache, executes the upstream request, and pipes results back to both the caller and the reporting service. This separation means the proxy can be scaled independently from the main API. The system uses a continuation-local-storage (CLS) module to thread request context through the NestJS request lifecycle without explicit parameter passing, which is an elegant solution for audit logging and telemetry correlation.
Tech Stack The backend runs on NestJS 9 with Apollo Server 4 and GraphQL for the API layer, with Prisma handling PostgreSQL for transactional data (prompts, versions, organizations, credentials) and Knex with a custom ClickHouse dialect handling analytical workloads (request reports, metrics, cost data). Authentication is delegated entirely to Supertokens, which manages sessions, email verification, and social login through its own service. The proxy is a minimal Express application that uses Axios to forward requests to the OpenAI API. Redis 4 serves both as a cache store and for pubsub. The frontend is built with React 18 and Next.js 13, using Radix UI primitives with Tailwind CSS, plus CodeMirror for the prompt editor. The entire workspace is unified under Nx 16 with shared TypeScript libraries for types, client SDK logic, Kafka integration, and UI components.
Code Quality
The codebase demonstrates strong TypeScript discipline with consistent typing across server, client, and shared type libraries. NestJS dependency injection is used correctly throughout, with each module declaring its providers and exports explicitly. The reporting pipeline is particularly well-structured — raw execution data passes through a dedicated buildRequestReport utility that calculates costs, token counts, and durations before storage, keeping the service layer clean. Error handling in the proxy relies on Axios catch blocks that pipe upstream error responses back to callers, which is functional but not deeply typed. Test coverage is limited — no .spec.ts files were found in the main application code, meaning the test suite does not cover the service and resolver layers. The Nx workspace enforces linting boundaries, and Prettier with prettier-plugin-tailwindcss handles formatting consistently across the frontend.
What Makes It Unique
Pezzo’s most distinctive technical decision is the transparent proxy architecture: rather than requiring developers to refactor their LLM calls through a new SDK abstraction, it intercepts standard OpenAI HTTP requests and augments them transparently. This dramatically lowers the adoption barrier — existing code needs only a base URL change. Equally notable is the dual-database strategy: PostgreSQL for mutable operational data and ClickHouse for append-only analytical workloads, which is a production-grade data architecture rarely seen in developer tooling at this scale. The real-time cost calculation system, which uses provider-specific token pricing tables from the @pezzo/llm-toolkit package to compute per-request dollar costs at report ingestion time, gives teams financial observability that most LLM monitoring tools treat as an afterthought.
Self-Hosting
Pezzo is licensed under the Apache License 2.0, one of the most permissive open-source licenses available. You can use it commercially, modify the source freely, and redistribute it without any copyleft requirements — the only obligations are preserving copyright notices and the license text in any derivative distributions. There are no “open core” restrictions, dual-licensing traps, or business source clauses: the entire codebase including the proxy, server, and console is covered under the same Apache 2.0 terms.
Running Pezzo yourself requires orchestrating five infrastructure services: PostgreSQL for transactional storage, ClickHouse for analytics, Redis for caching, Supertokens for authentication, and the Pezzo server and proxy as separate containers. A docker-compose.yaml is provided and handles the full stack, but self-hosters are responsible for production hardening — persistent volume management, database backups, TLS termination, and horizontal scaling. ClickHouse in particular has non-trivial operational overhead compared to standard relational databases. Prisma migration management is manual: schema changes require running prisma migrate deploy explicitly, which needs to be built into your deployment pipeline. The team recommends Node.js 18+ and Docker as prerequisites.
Pezzo Cloud (pezzo.ai) offers a managed version of the platform that eliminates all of this operational burden. Compared to self-hosting, the managed tier adds support for multi-user organizations with role-based access, managed uptime and backups, and ongoing updates without migration management. The PEZZO_CLOUD=true environment variable in the open-source codebase enables a Notifications module that is conditionally excluded from self-hosted builds, suggesting some cloud-specific features exist that are not available to self-hosters. No public pricing page was available at time of writing; teams evaluating Pezzo Cloud should contact the company directly.