Pezzo is an open-source LLMOps platform designed for developers who need to manage, monitor, and optimize AI prompts and LLM operations at scale. It solves the fragmentation problem in AI development by unifying prompt versioning, caching, observability, and collaboration into a single platform. Built for teams using OpenAI, GPT-3/4, LangChain, or other LLM providers, Pezzo reduces costs and latency while improving reliability.
The platform is cloud-native and built with TypeScript, NestJS, and PostgreSQL, with support for ClickHouse for analytics and Redis for caching. It offers client libraries for Node.js and Python, with LangChain integration, and can be deployed via Docker Compose or self-hosted on any infrastructure.
What You Get
- Prompt Management - Version-controlled, reusable prompts with rollback and comparison tools for teams collaborating on AI workflows.
- Observability - Real-time monitoring of LLM requests, latency, token usage, and errors with detailed traces and logs.
- Caching - Intelligent response caching to reduce API costs and latency by up to 90% on repeated prompts.
- Multi-Client Support - Official Node.js and Python SDKs, plus LangChain integration for seamless LLM pipeline embedding.
- Self-Hosted Deployment - Full control via Docker Compose with PostgreSQL, ClickHouse, Redis, and Supertokens for authentication.
- GraphQL API - Programmatically manage prompts, retrieve analytics, and integrate with custom tooling using a typed GraphQL schema.
Common Use Cases
- Running production AI chatbots - A startup uses Pezzo to version-control prompts across staging and production, monitor token usage, and cache frequent responses to cut OpenAI costs by 70%.
- Building RAG systems with LangChain - A data scientist integrates Pezzo’s Python client to track retrieval quality, cache embeddings, and debug failed queries in real time.
- Managing enterprise AI workflows - A corporate AI team uses Pezzo’s collaboration features to standardize prompts across departments and audit changes for compliance.
- Optimizing LLM APIs for mobile apps - A developer uses Pezzo’s caching and observability to reduce latency and avoid rate limits in a mobile app powered by GPT-4.
Under The Hood
Architecture
- Monorepo structure powered by Nx with standardized generators and migrations ensuring consistent project organization across applications and libraries
- Clear separation between backend services (Prisma, ClickHouse, Kafka) and frontend components (React contexts, form fields) via independent Docker services and build contexts
- Dependency injection patterns through Prisma and custom service layers that abstract external systems like Kafka and Redis
- Domain-specific modularity with SuperTokens for authentication, local-kms for encryption, and dedicated data pipelines for analytics and event streaming
- React frontend leverages context-based state management to eliminate prop drilling while preserving component modularity
- API layer built with Express and Prisma, using versioned endpoints and middleware to decouple routing from business logic, supported by infrastructure-as-code via Docker Compose
Tech Stack
- NestJS backend with GraphQL and Apollo Server, integrated with Prisma and Knex for flexible database abstraction across PostgreSQL and ClickHouse
- React 18 and Next.js 13 frontend with Radix UI and Tailwind CSS, fully typed with TypeScript for robust UI development
- NX monorepo unifying client, server, and UI modules through shared libraries and path aliases for scalable code reuse
- Modern event-driven infrastructure with Supertokens, Redis, and Kafka for authentication, caching, and real-time event processing
- Dual-database strategy: PostgreSQL for transactional data and ClickHouse for analytical workloads, both Dockerized for reproducible environments
- Comprehensive tooling including GraphQL Codegen, NX-integrated Jest, ESLint with module boundary enforcement, and Docker Compose for seamless local and production workflows
Code Quality
- Consistent TypeScript configuration across all modules with shared type definitions and test environments
- Strong type safety enforced through centralized configs and shared type libraries, improving cross-module reliability
- Uniform naming conventions and standardized linting/build tooling via NX
- Limited structured error handling, relying on reactive patterns rather than proactive error boundaries
- Test files are well-organized but lack meaningful assertions, reducing confidence in correctness and coverage
- Build and linting pipelines are robust, though test quality remains a concern due to absence of validation logic
What Makes It Unique
- Native observability layer that automatically traces LLM requests with real-time token costing and cache analytics, requiring no manual instrumentation
- Unified API gateway that abstracts multiple LLM providers with automatic metadata enrichment and cost calculation at the proxy level
- Dynamic model author branding system that renders provider-specific logos and colors based on origin, enhancing transparency
- Real-time request inspection UI that reconstructs conversational context from raw payloads, enabling debugging without client-side access
- Deep integration of caching and cost analytics into the developer workflow, transforming LLM usage into auditable, measurable metrics
- GraphQL console with auto-generated queries and mutations derived from type-safe SDKs, eliminating boilerplate while enforcing data contracts