Langfuse
Open source AI engineering platform for LLM observability, prompt management, evaluation, and debugging — self-host in minutes or use Langfuse Cloud.
Langfuse is an open source LLM engineering platform built for teams developing, monitoring, and debugging AI applications at scale. It provides end-to-end observability across complex, multi-step LLM workflows — capturing traces, spans, user sessions, and cost metrics — with zero-latency impact thanks to asynchronous ingestion and aggressive caching.
The platform unifies several critical workflows that teams otherwise stitch together from separate tools: tracing LLM calls and agent actions, versioning and deploying prompts collaboratively, running evaluations using LLM-as-a-judge or human labeling, and managing datasets for continuous benchmarking. All of these are integrated into a single system with shared context, enabling faster iteration cycles.
Langfuse integrates natively with major frameworks and SDKs including OpenAI, LangChain, LlamaIndex, LiteLLM, Vercel AI SDK, Mastra, Haystack, DSPy, and Instructor — with both automated and manual instrumentation options. It also supports OpenTelemetry as a standard ingestion path, making it compatible with any observability stack.
Deployment is flexible: teams can run Langfuse on a single machine with Docker Compose in under five minutes, scale to Kubernetes using the official Helm chart, or use managed Terraform templates for AWS, Azure, and GCP. The managed Langfuse Cloud option offers a generous free tier with no credit card required.
What You Get
- Full-stack LLM tracing - Captures every LLM call, retrieval step, embedding operation, and agent action as structured spans with parent-child relationships, latency, token usage, cost attribution, and user/session context — queryable via the UI or API.
- Collaborative prompt management - Centralized version control for prompts with environment-based labels (production, staging), server-side and client-side caching for zero-latency updates, and a diff view for collaborative iteration without any code deploys.
- Evaluation pipelines - Supports LLM-as-a-judge with configurable templates, code-based evaluators, manual human labeling via annotation queues, user feedback collection, and custom evaluation pipelines via the API — all tied to specific trace observations.
- Dataset and experiment management - Create labeled test sets from production traces, run structured experiments against datasets, compare evaluation scores across prompt versions and model configs, and track regression over time.
- Interactive LLM playground - Test and iterate on prompts and model configurations directly in the UI with real-time results, with a one-click path from a failing trace to the playground for rapid debugging.
- Multi-framework integrations - Native integrations with OpenAI SDK, LangChain, LlamaIndex, LiteLLM, Vercel AI SDK, Haystack, Mastra, and more — via callback handlers, drop-in SDK replacements, or OpenTelemetry for any custom stack.
- Comprehensive REST API and typed SDKs - Full OpenAPI spec, Postman collection, and typed Python and JS/TS SDKs for building custom LLMOps workflows and automated tooling on top of Langfuse data.
- Flexible deployment - Docker Compose for local or single-VM setup, Helm chart for Kubernetes production deployments, Terraform templates for AWS/Azure/GCP, or managed Langfuse Cloud with a free tier.
Common Use Cases
- Debugging a flaky AI agent - An ML engineer sees a production trace where an agent took a wrong decision branch; they inspect the full span tree including retrieval steps, intermediate LLM calls, and tool outputs, then jump directly to the playground to reproduce and fix the prompt.
- Running prompt A/B tests pre-deployment - A product team deploys two prompt versions with different system instructions to staging, runs both against a curated dataset, compares LLM-as-a-judge evaluation scores, and promotes the winning version to production without touching application code.
- Cost and latency monitoring at scale - A startup processing 50,000 LLM requests per day uses Langfuse dashboards to monitor per-user token spend, p95 latency by trace type, and identify which prompt templates are driving cost overruns.
- Building human-in-the-loop labeling workflows - A team uses annotation queues to route low-confidence LLM outputs to human reviewers, captures labels as scores attached to trace observations, and feeds the labeled data back into evaluation pipelines to retrain their LLM-as-a-judge evaluator.
- Continuous regression testing for LLM pipelines - A platform team maintains a golden dataset of 200 test cases, runs evaluation on every CI deployment, and fails the build when quality scores drop below a threshold — catching prompt regressions before they reach users.
Under The Hood
Architecture Langfuse is structured as a TypeScript monorepo with three deployment units: a Next.js web frontend, a Node.js background worker, and a shared package containing Prisma schemas, ClickHouse query builders, and queue definitions. The architecture is event-driven at its core — ingested telemetry flows through Redis-backed BullMQ queues into ClickHouse for analytics and PostgreSQL via Prisma for relational data, with S3-compatible object storage (MinIO in self-hosted setups) handling large event payloads. The worker manages a diverse set of named queues covering evaluation execution, batch exports, event propagation, data retention, and integrations, each with isolated processor logic and retry semantics. This separation of concerns allows the frontend and worker to scale independently, and ensures that analytics workloads never block ingestion throughput. Dependency injection through TypeScript interfaces and environment-aware configuration allows the same codebase to run from a single Docker Compose instance to a multi-node Kubernetes cluster.
Tech Stack The web application is built on Next.js with the App Router, TypeScript throughout, NextAuth for authentication (with extensive patching for custom SSO and multi-tenant OIDC), and tRPC for type-safe internal APIs. PostgreSQL is accessed via Prisma ORM and serves as the transactional datastore for users, projects, prompts, and evaluation configurations. ClickHouse is the analytical backbone — handling high-cardinality trace data, observation metrics, and dashboard aggregations with bloom filter indexes and query caching for performance at scale. Redis powers the BullMQ job queues and application-level caching for prompt content. Object storage via S3-compatible APIs (AWS S3, MinIO, Azure Blob, OCI) handles event payloads and media. The monorepo is orchestrated with Turborepo, tested with Vitest for unit/integration tests and Playwright for end-to-end tests, and deployed as multi-stage Docker images.
Code Quality Langfuse maintains extensive test coverage across 142 test files spanning unit tests, integration tests that exercise real database connections, and Playwright end-to-end tests covering authentication flows and core UI interactions. Type safety is enforced pervasively through Zod schema validation at API boundaries, Prisma-generated types for database access, and strict TypeScript configuration. Error handling is explicit and domain-specific — ingestion errors, evaluation failures, and queue processing errors each have typed error classes with structured logging. The codebase has a custom ESLint plugin enforcing project-specific rules including Tailwind overflow conventions and prohibitions on in-source test code. CI runs lint, typecheck, and test suites on every pull request via GitHub Actions, with Snyk for security scanning.
What Makes It Unique Langfuse’s most distinctive capability is its tight integration between observability and the evaluation-improvement loop. Rather than treating tracing and evaluation as separate systems, Langfuse allows engineers to click from any trace observation directly into an evaluation template or dataset item — creating a feedback loop that is genuinely embedded in the debugging workflow. The platform also ships with built-in statistical evaluation tools including LLM-as-a-judge with configurable scoring templates, code-based evaluators that can execute arbitrary Python, and inter-annotator agreement metrics for human labeling workflows. Its OpenTelemetry-native ingestion path means teams can instrument once and route to Langfuse without vendor lock-in. The dual storage architecture — PostgreSQL for relational entities, ClickHouse for analytics — is carefully designed to handle ingestion rates that would overwhelm a purely relational system while preserving the ability to write complex queries against historical data.
Self-Hosting
Langfuse uses a split-license model. The core platform — everything outside the ee/, web/src/ee/, and worker/src/ee/ directories — is licensed under the MIT Expat license, which means you can use it commercially, modify it, redistribute it, and self-host it without any royalty or license fee obligations. The enterprise edition features, which live under those ee/ paths, are licensed under a separate proprietary license that reserves those capabilities for paid Enterprise Edition customers and Langfuse Cloud subscribers. In practice, the MIT-licensed core is fully functional for most teams and includes tracing, prompt management, evaluations, datasets, playground, and the full API surface.
Running Langfuse yourself is non-trivial at production scale. The minimal setup requires PostgreSQL, ClickHouse, Redis, and an S3-compatible object store — five services in total that each require their own capacity planning, backup strategy, and upgrade management. Docker Compose handles all of this on a single machine for development and small deployments, but production-grade setups typically mean Kubernetes with the official Helm chart, which introduces its own operational complexity. Teams need to manage ClickHouse schema migrations alongside application upgrades, configure Redis clustering for high availability, and handle PostgreSQL backups independently. The Langfuse team releases new versions multiple times per week (v3.192 as of June 2026), so patch management is an active ongoing responsibility.
Compared to Langfuse Cloud, a self-hosted deployment gives up managed infrastructure, automated upgrades, SLA-backed uptime guarantees, and first-party support channels. The Cloud offering includes a generous free tier, and paid tiers add data retention controls, SSO configuration, and access to enterprise features gated behind the EE license. For teams with strict data residency requirements, self-hosting provides complete control over where telemetry data lands; for teams that want observability without the infrastructure burden, Langfuse Cloud is the lower-friction path.