superlog

Name: superlog
Rating: 5 (984 reviews)

Open-source agentic observability that ingests OpenTelemetry signals, groups them into incidents, and deploys AI agents to investigate and fix your production bugs automatically.

984stars

73forks

Apache License 2.0

TypeScript

View Source Visit Website

On This Page

Superlog is an open-core observability workspace purpose-built for teams that want more than dashboards — it wants to fix your bugs while you sleep. It ingests traces, logs, and metrics over OpenTelemetry, fingerprints and groups noisy signals into coherent incidents, then dispatches AI agent runners that clone your GitHub repository, root-cause the failure, and open pull requests with proposed fixes.

The community edition ships a complete self-hosted stack: a Vite/React web application for incident investigation, an HTTP API with full multi-tenancy, an OTLP intake proxy that handles both standard OpenTelemetry protocol and AWS Kinesis Firehose delivery streams, and a background worker that orchestrates agent lifecycles, incident grouping decisions, and auto-recovery proposals. Every layer is backed by Postgres for relational state and ClickHouse for telemetry queries at scale.

Agent runs are first-class stateful objects with a defined lifecycle — queued, repo_discovery, running, awaiting_human, complete — and can be resumed interactively after they finish, letting engineers talk to an investigation directly in Slack or the web UI. The pluggable agent runner backend means the same incident can be routed to different investigation runtimes, and memory between runs accumulates project-level context over time.

Superlog is a Y Combinator P26 company offering a free self-hosted community edition as well as Superlog Cloud with a free tier, pay-as-you-go metering, and bundled credit packs for investigation runs. The codebase is Apache 2.0 licensed, written entirely in TypeScript, and maintained with active daily commits.

What You Get

OTLP intake proxy that accepts traces, logs, and metrics from any OpenTelemetry SDK and supports AWS Kinesis Firehose HTTP endpoint delivery for CloudWatch Metric Streams
Incident grouping agent that uses fingerprint matching and AI reasoning to cluster incoming error signals with existing open incidents rather than flooding you with duplicate alerts
Agent run lifecycle engine with states queued → repo_discovery → running → awaiting_human → complete, full event history, and interactive resume after terminal runs
Auto-recovery worker that periodically inspects resolved-looking incidents and submits confidence-scored resolution proposals to close them automatically
ClickHouse-backed telemetry queries for high-cardinality span and log data, with Postgres holding relational incident, project, and agent state
Pluggable agent runner backend interface so community, managed, and custom investigation runtimes can be swapped without changing the orchestration core
GitHub integration for repository discovery, branch management, PR creation, and source-map symbolication of minified JavaScript stack traces
Slack integration for incident notifications, thread pinning, and human-in-the-loop resume interactions directly from Slack messages

Common Use Cases

Automatically investigate recurring backend exceptions by connecting your Node.js, Python, or Go service’s OpenTelemetry SDK to Superlog’s OTLP proxy and letting agent runs root-cause stack traces against your source code
Consolidate CloudWatch Metric Streams and log subscription filters from multiple AWS accounts into unified incidents using the Kinesis Firehose intake path and cloud connections
Set up a nightly auto-recovery pass that checks open incidents whose error rate has dropped and proposes resolution with an AI confidence score, reducing stale-incident noise
Give engineers an interactive investigation experience by threading a resolved agent run back into a Slack conversation and letting them ask follow-up questions about the diagnosis
Run Superlog locally or on a single VM with Docker Compose for small teams that need production observability without sending data to a third-party SaaS
Extend Superlog’s investigation capabilities by writing custom skills using the npx skills workflow, adding domain-specific tools your AI agents can invoke during root-cause analysis

Under The Hood

Architecture Superlog uses a layered monorepo structure where each application has a single responsibility and communicates through well-defined interfaces rather than shared mutable state. The OTLP proxy handles ingest authentication and tenant routing, forwarding stamped signals to the OpenTelemetry Collector; the API owns all relational reads and writes through a repository pattern behind a Drizzle ORM abstraction; and the worker exclusively drives background state machines — incident grouping, agent run lifecycle ticking, auto-recovery sweeps, and digest generation — polling the database rather than consuming events directly. The agent runner backend is defined as a pure interface type, allowing the community managed runner and any custom runtime to slot in without touching orchestration logic. Domain modules in the worker are intentionally pure — grouping decisions, autorecovery proposal evaluation, and agent run state assertions are all functions from plain objects to plain objects, making them independently testable without touching I/O.

Tech Stack The entire codebase is TypeScript 5.7 running on Node.js 20+, managed as a pnpm workspace and built with Turborepo for incremental task caching. The web frontend is Vite with React, using a REST API client generated from the same TypeScript types. The API is a custom HTTP server backed by Drizzle ORM over Postgres (via the postgres driver), with ClickHouse handling high-cardinality telemetry queries for spans, logs, and metric points. Biome replaces ESLint and Prettier for linting and formatting. The OTLP proxy speaks both the OpenTelemetry protobuf protocol and the AWS Kinesis Firehose HTTP endpoint spec. AI investigation runs use the Anthropic SDK, with the MCP SDK providing the tool interface that agent runners expose. Billing metering is handled through Autumn (config-as-code Stripe integration).

Code Quality Test coverage is present but uneven — the CONTRIBUTING.md honestly states approximately 44% overall coverage, with the worker’s core domain modules (agent run lifecycle, grouping domain, autorecovery policy, incident state machines) having dedicated test files and the API surface having more sparse coverage. Error handling is explicit and typed throughout: agent run failures have a typed union AgentRunFailureReason with a companion agentRunFailureCategory function that classifies failures as agent, deliverable, or infra problems. Domain files are pure functions with no I/O, and state transition functions use asserting preconditions (assertAgentRunSourceState) that throw on illegal transitions. The codebase is actively linted with Biome and type-checked with tsc --noEmit across all packages.

What Makes It Unique The distinctive capability is the interactive agent run — once an AI investigation completes and produces a root-cause summary, the run can be revived through a human message (via Slack or the web UI) and resume its durable provider session in place, turning a finished investigation into a conversation. The auto-recovery worker adds a second autonomous loop that watches for incidents that appear resolved and submits confidence-scored proposals to close them, with a configurable minimum confidence gate that keeps false positives in check. The pluggable agent runner interface combined with the skills system means teams can extend investigation behaviour by composing typed tools rather than forking the core orchestration engine.

Self-Hosting

Superlog is released under the Apache License 2.0, a permissive open-source license that allows unrestricted commercial use, modification, and distribution. There are no copyleft conditions that would require you to open-source your own application code: you can run Superlog inside a private internal deployment, build proprietary integrations, and ship it inside a commercial product without licensing obligations. The only requirements are to include the Apache 2.0 license notice and to preserve attribution to the original authors.

Running Superlog yourself requires Docker Compose (or equivalent container orchestration), a Postgres instance, and a ClickHouse cluster for telemetry data. The default development setup brings all of these up locally, but production deployments need careful thought about persistence, ClickHouse replication for query availability, and Postgres backup schedules. The worker process handles all AI agent invocations against the Anthropic API, so you will need API credentials and a plan for managing LLM spend — the community edition has no built-in cost cap on investigation runs. Security hardening, TLS termination, and multi-region failover are entirely the operator’s responsibility.

Superlog Cloud, the hosted tier, adds a free allowance of telemetry signals and investigation credits, pay-as-you-go metering beyond those limits, and bundled power packs ($150 and $300 per period) that taper the per-investigation credit rate down. Self-hosters forgo managed upgrades, the Superlog-operated ClickHouse infrastructure, automatic database migrations on new releases, and any SLA or priority support. The cloud tier also receives new investigation runtime features and skill updates before the community edition, since managed runtimes can be updated server-side. Teams that choose self-hosting gain full data sovereignty and the ability to run investigation agents against code repositories that cannot be accessed from a cloud-hosted service.

Related Apps

Rust

95%

MIT

claw-code

AI Agents · AI Code Assistants

194,567

A Rust-built CLI agent harness for Claude AI with persistent sessions, MCP tool integration, plugin hooks, and multi-provider support — designed to run autonomous coding workflows without human babysitting.

View details