Agenta is an open-source LLMOps platform designed for engineering and product teams building production-grade LLM applications. It solves the chaos of scattered prompts, ad-hoc testing, and lack of visibility by unifying prompt management, evaluation, and observability into a single interface. Teams can collaborate across roles—engineers, PMs, and SMEs—while maintaining version control and data-driven decision-making.
Built with TypeScript and Docker, Agenta supports 50+ LLM providers, integrates with OpenTelemetry via OpenLLMetry and OpenInference, and offers both cloud-hosted and self-hosted deployment options. Its architecture enables seamless experimentation, traceability, and monitoring of LLM workflows without vendor lock-in.
What You Get
- Interactive LLM Playground - Compare multiple prompts and models side-by-side with real-time outputs against test cases, enabling data-driven prompt iteration.
- Multi-Model Support - Experiment with 50+ LLM providers including OpenAI, Anthropic, and self-hosted models via custom provider integration.
- Prompt Version Control - Track changes to prompts and configurations with branching, environment isolation, and rollback capabilities.
- Flexible Testsets - Create test cases from production logs, playground experiments, or upload CSVs to systematically evaluate LLM performance.
- LLM-as-a-Judge Evaluators - Use pre-built or custom evaluators (including LLM-as-a-judge) to automate quality scoring of outputs without manual review.
- OpenTelemetry Tracing - Monitor LLM calls in production with full traceability using OpenLLMetry and OpenInference standards for debugging and cost analysis.
Common Use Cases
- Building production RAG systems - A data scientist uses Agenta to version prompt templates, evaluate retrieval quality across models, and trace latency spikes in real-time.
- Scaling AI product development - A product team collaborates with SMEs to refine prompts in a shared playground, then runs automated evaluations before deploying to users.
- Debugging LLM failures in production - An engineer saves a failed user query as a test case, reproduces it in the playground, and traces the exact step where the model hallucinated.
- Managing enterprise LLM compliance - A compliance officer audits prompt versions and evaluation results to ensure all LLM outputs meet regulatory standards before release.
Under The Hood
Architecture
- Clear separation of concerns through dedicated API routes and resource-specific handlers, enforcing modular boundaries between organizations, domains, and providers
- Domain-driven design with explicit entity classes that encapsulate business rules and validation logic
- Service layers abstract business logic from route controllers, with repository-like patterns isolating data access
- Lightweight dependency management via manual service factories rather than heavy frameworks, promoting explicit configuration
- React frontend components use composition and localized state management, avoiding global state bloat
Tech Stack
- Python backend powered by FastAPI with Pydantic for validation and SQLAlchemy with Alembic for ORM and migrations
- PostgreSQL as the primary database, enhanced with SQLAlchemy-Utils for advanced schema capabilities
- TaskIQ with Redis for asynchronous task processing and distributed workflow orchestration
- React frontend integrated with a custom UI module for agent-specific interfaces
- Environment configuration managed via python-dotenv, with standardized routing and dependency injection in FastAPI
- Docker and Kubernetes enable containerized deployment with automated CI/CD pipelines
Code Quality
- Extensive test suite covering unit, integration, and E2E scenarios with clear ARRANGE-ACT-ASSERT structure and environment-aware categorization
- Robust error handling with consistent HTTP status codes and descriptive messages across APIs
- Strong type safety enforced via Pydantic models and instrumentation decorators that enable tracing and observability
- Consistent, domain-aligned naming conventions across code and tests
- Limited use of static analysis or linting tooling, with no visible CI-enforced quality gates
What Makes It Unique
- Innovative TaskIQ integration that manages webhook retry logic and delivery state entirely within the broker, eliminating database calls during retries
- Proprietary shared editor component that unifies text, JSON, YAML, HTML, and Markdown editing with format-aware controls in a single interface
- Deep Jotai state management enabling atomic, reactive updates to complex variant hierarchies without Redux-style boilerplate
- Dynamic variant deletion system that evaluates resource validity through entity-aware atoms to prevent accidental removal of active configurations
- Encrypted webhook secrets with context-aware retry logic that preserves payload integrity while minimizing I/O during failures
- Seamless real-time coupling between backend task workers and frontend playground state, eliminating polling or socket-based synchronization