Open-source agentic observability that ingests OpenTelemetry signals, groups them into incidents, and deploys AI agents to investigate and fix your production bugs automatically.
Superlog is an open-core observability workspace purpose-built for teams that want more than dashboards — it wants to fix your bugs while you sleep. It ingests traces, logs, and metrics over OpenTelemetry, fingerprints and groups noisy signals into coherent incidents, then dispatches AI agent runners that clone your GitHub repository, root-cause the failure, and open pull requests with proposed fixes.
The community edition ships a complete self-hosted stack: a Vite/React web application for incident investigation, an HTTP API with full multi-tenancy, an OTLP intake proxy that handles both standard OpenTelemetry protocol and AWS Kinesis Firehose delivery streams, and a background worker that orchestrates agent lifecycles, incident grouping decisions, and auto-recovery proposals. Every layer is backed by Postgres for relational state and ClickHouse for telemetry queries at scale.
Agent runs are first-class stateful objects with a defined lifecycle — queued, repo_discovery, running, awaiting_human, complete — and can be resumed interactively after they finish, letting engineers talk to an investigation directly in Slack or the web UI. The pluggable agent runner backend means the same incident can be routed to different investigation runtimes, and memory between runs accumulates project-level context over time.
Superlog is a Y Combinator P26 company offering a free self-hosted community edition as well as Superlog Cloud with a free tier, pay-as-you-go metering, and bundled credit packs for investigation runs. The codebase is Apache 2.0 licensed, written entirely in TypeScript, and maintained with active daily commits.
Architecture Superlog uses a layered monorepo structure where each application has a single responsibility and communicates through well-defined interfaces rather than shared mutable state. The OTLP proxy handles ingest authentication and tenant routing, forwarding stamped signals to the OpenTelemetry Collector; the API owns all relational reads and writes through a repository pattern behind a Drizzle ORM abstraction; and the worker exclusively drives background state machines — incident grouping, agent run lifecycle ticking, auto-recovery sweeps, and digest generation — polling the database rather than consuming events directly. The agent runner backend is defined as a pure interface type, allowing the community managed runner and any custom runtime to slot in without touching orchestration logic. Domain modules in the worker are intentionally pure — grouping decisions, autorecovery proposal evaluation, and agent run state assertions are all functions from plain objects to plain objects, making them independently testable without touching I/O.
Tech Stack
The entire codebase is TypeScript 5.7 running on Node.js 20+, managed as a pnpm workspace and built with Turborepo for incremental task caching. The web frontend is Vite with React, using a REST API client generated from the same TypeScript types. The API is a custom HTTP server backed by Drizzle ORM over Postgres (via the postgres driver), with ClickHouse handling high-cardinality telemetry queries for spans, logs, and metric points. Biome replaces ESLint and Prettier for linting and formatting. The OTLP proxy speaks both the OpenTelemetry protobuf protocol and the AWS Kinesis Firehose HTTP endpoint spec. AI investigation runs use the Anthropic SDK, with the MCP SDK providing the tool interface that agent runners expose. Billing metering is handled through Autumn (config-as-code Stripe integration).
Code Quality
Test coverage is present but uneven — the CONTRIBUTING.md honestly states approximately 44% overall coverage, with the worker’s core domain modules (agent run lifecycle, grouping domain, autorecovery policy, incident state machines) having dedicated test files and the API surface having more sparse coverage. Error handling is explicit and typed throughout: agent run failures have a typed union AgentRunFailureReason with a companion agentRunFailureCategory function that classifies failures as agent, deliverable, or infra problems. Domain files are pure functions with no I/O, and state transition functions use asserting preconditions (assertAgentRunSourceState) that throw on illegal transitions. The codebase is actively linted with Biome and type-checked with tsc --noEmit across all packages.
What Makes It Unique The distinctive capability is the interactive agent run — once an AI investigation completes and produces a root-cause summary, the run can be revived through a human message (via Slack or the web UI) and resume its durable provider session in place, turning a finished investigation into a conversation. The auto-recovery worker adds a second autonomous loop that watches for incidents that appear resolved and submits confidence-scored proposals to close them, with a configurable minimum confidence gate that keeps false positives in check. The pluggable agent runner interface combined with the skills system means teams can extend investigation behaviour by composing typed tools rather than forking the core orchestration engine.
Superlog is released under the Apache License 2.0, a permissive open-source license that allows unrestricted commercial use, modification, and distribution. There are no copyleft conditions that would require you to open-source your own application code: you can run Superlog inside a private internal deployment, build proprietary integrations, and ship it inside a commercial product without licensing obligations. The only requirements are to include the Apache 2.0 license notice and to preserve attribution to the original authors.
Running Superlog yourself requires Docker Compose (or equivalent container orchestration), a Postgres instance, and a ClickHouse cluster for telemetry data. The default development setup brings all of these up locally, but production deployments need careful thought about persistence, ClickHouse replication for query availability, and Postgres backup schedules. The worker process handles all AI agent invocations against the Anthropic API, so you will need API credentials and a plan for managing LLM spend — the community edition has no built-in cost cap on investigation runs. Security hardening, TLS termination, and multi-region failover are entirely the operator’s responsibility.
Superlog Cloud, the hosted tier, adds a free allowance of telemetry signals and investigation credits, pay-as-you-go metering beyond those limits, and bundled power packs ($150 and $300 per period) that taper the per-investigation credit rate down. Self-hosters forgo managed upgrades, the Superlog-operated ClickHouse infrastructure, automatic database migrations on new releases, and any SLA or priority support. The cloud tier also receives new investigation runtime features and skill updates before the community edition, since managed runtimes can be updated server-side. Teams that choose self-hosting gain full data sovereignty and the ability to run investigation agents against code repositories that cannot be accessed from a cloud-hosted service.
Monitoring
Self-hosted monitoring for every service you run — 23 monitor types, 95 notification channels, live dashboards, and public status pages with no vendor lock-in.
Security · Developer Tools · Monitoring
Developer-first error tracking and performance monitoring platform with AI-powered root-cause analysis across 20+ languages and frameworks.
Devops · Hosting Control Panel · Monitoring
The only open-source VPS control panel with native AI agent runtime — deploy websites, Docker stacks, and local LLMs from one web interface.