OneUptime

Name: OneUptime
Rating: 5 (7252 reviews)

The complete open-source observability platform that replaces PagerDuty, Datadog, Sentry, and StatusPage with a single self-hostable system.

7.3Kstars

410forks

Apache License 2.0

TypeScript

View Source Visit Website

On This Page

OneUptime is a comprehensive open-source observability platform that unifies uptime monitoring, status pages, incident management, on-call scheduling, logs, metrics, traces, and AI-driven error remediation into a single self-hostable system. It is purpose-built for DevOps and SRE teams who want to eliminate tool sprawl and maintain full control over their observability data.

The platform ingests telemetry from OpenTelemetry-compatible agents, native collectors for Kubernetes, Docker, Proxmox, Ceph, and Podman, and custom probes deployed globally. It stores metrics and traces in ClickHouse for high-throughput querying, while PostgreSQL backs all operational data including incidents, alerts, and on-call schedules.

What makes OneUptime distinctive is its integrated AI Copilot: an autonomous agent that monitors production services, detects exceptions and anomalies, clones the relevant repository, generates code fixes, and opens pull requests automatically. It also ships a native Model Context Protocol (MCP) server, letting LLM agents interact with monitoring and incident operations through a structured tool interface.

Deployed via Helm on Kubernetes or Docker Compose for single-node setups, OneUptime supports multi-tenant projects with role-based permissions, custom branding on status pages, and integrations with Slack, Jira, GitHub, and over 5,000 additional tools via its workflow automation engine.

What You Get

Uptime Monitoring - Monitor websites, APIs, TCP ports, and custom scripts from globally distributed probe locations with configurable check intervals down to 30 seconds, SSL expiry alerts, and synthetic user flow simulations.
Status Pages - Publish custom-branded, public-facing status pages with real-time component status, 90-day uptime history graphs, subscriber notifications via email, SMS, or RSS, and maintenance window announcements.
Incident Management - Declare incidents from Slack or the dashboard, auto-assign responders by service ownership, track timeline updates collaboratively, and close incidents with built-in postmortem templates.
On-Call Scheduling - Define rotating on-call schedules with escalation policies, vacation overrides, and multi-channel alerting via phone call, SMS, email, Slack, or PagerDuty-compatible webhooks.
Logs Management - Ingest and search logs from Kubernetes, Docker, Podman, Linux hosts, and custom applications via OpenTelemetry or FluentBit, with real-time tailing and structured log filtering.
Application Performance Monitoring - Capture distributed traces with end-to-end latency breakdowns, service dependency maps generated by eBPF probes in Kubernetes, and auto-instrumentation via one-click OTel setup.
Error Tracking - Detect, group, and triage exceptions with stack traces, user context, and automatic cross-linking to related logs, traces, and open incidents.
AI Copilot - An autonomous AI agent that detects anomalies across observability signals, identifies root causes, clones your repository, generates code fixes, and opens pull requests without human intervention.
MCP Server - A first-class Model Context Protocol server that exposes incident management, monitoring, and status page operations as structured tools consumable by any LLM agent.
Infrastructure Monitoring - Deploy copy-paste OpenTelemetry agents for bare-metal servers, Kubernetes clusters, Docker hosts, Proxmox hypervisors, and Ceph storage with alert templates pre-configured.
Workflow Automation - Connect OneUptime to Slack, Jira, GitHub, and 5,000+ apps to automate alert routing, incident channel creation, ticket creation, and remediation runbooks.
Dashboards and Metrics - Build custom dashboards combining infrastructure metrics, business KPIs, and trace data with real-time refresh and shareable links for stakeholders.

Common Use Cases

SaaS product reliability - A startup deploys OneUptime probes globally to monitor their API, publishes a branded status page for customers, and uses on-call scheduling to ensure the right engineer is paged when error rates spike in production.
Replacing a monitoring SaaS stack - An engineering team consolidates PagerDuty, Datadog, Sentry, and StatusPage.io into OneUptime, cuts their tool budget significantly, and gains unified context across alerts, traces, and incident timelines.
Kubernetes observability - A platform team installs the OneUptime Helm chart alongside the OTel-based Kubernetes agent to get node metrics, pod logs, eBPF service maps, and cluster health alerts in a single dashboard.
AI-assisted incident remediation - An SRE team connects OneUptime AI Copilot to their GitHub repositories; when an exception is detected in production, the agent automatically opens a pull request with a code fix before the on-call engineer even acknowledges the alert.
Proxmox and Ceph infrastructure monitoring - A self-hosted cloud provider uses OneUptime’s native Proxmox and Ceph agents to monitor VM health, storage replication status, OSD health, and cluster capacity with built-in alert templates.
LLM agent operations tooling - A team building AI agents connects OneUptime’s MCP server to their LLM workflows, allowing the model to query incident history, check monitor status, and acknowledge alerts as part of autonomous operations pipelines.

Under The Hood

Architecture OneUptime is organized as a monorepo of feature-scoped microservices — Telemetry, Notification, Identity, Workflow, Runbook, AI Agent, MCP, StatusPage, and others — each exposing its own Express HTTP server and sharing infrastructure through a Common library that provides base classes, type definitions, and database connectivity. A FeatureSet.init() contract gives each subsystem a consistent initialization lifecycle, and a generic BaseAPI<TModel, TService> class establishes conventional CRUD REST patterns across all resources. Services share a single PostgreSQL schema for operational data and a ClickHouse cluster for time-series analytics, with Redis as the pub/sub backbone for real-time dashboard updates via Socket.io. The monorepo structure enables consistent patterns and shared utilities but creates tight deployment coupling — all services must be versioned and deployed together.

Tech Stack Node.js with TypeScript 5.x powers all services, using Express for HTTP routing and EJS for server-side template rendering in dashboard views. TypeORM manages relational data in PostgreSQL, while ClickHouse serves as the analytics database for high-throughput telemetry ingestion (logs, metrics, traces). Redis handles caching, sessions, and real-time pub/sub. A gRPC server in the Telemetry feature set handles high-volume OTel data ingest alongside HTTP endpoints for Syslog, Fluent, Pyroscope, and proprietary probe protocols. The AI Agent integrates with OpenCode, an open-source LLM coding agent, supporting multiple LLM provider backends for automated pull request generation. The MCP server implements the Model Context Protocol specification to expose platform operations as structured tools. Helm charts and Docker Compose configurations handle Kubernetes and single-node deployments respectively.

Code Quality The codebase maintains comprehensive test coverage using Playwright for E2E flows (sign-up, project creation, product-specific workflows for Proxmox and Ceph) and Jest for unit testing across packages. A @CaptureSpan() decorator applied throughout service code provides automatic OpenTelemetry tracing of individual methods without boilerplate. Typed exceptions (BadDataException, BadRequestException, NotFoundException, ServerException) enforce explicit error handling in service layers and API routes. ESLint with Prettier and TypeScript strict mode are enforced project-wide, and Husky pre-commit hooks prevent non-conforming commits. The combination of typed exceptions, decorator-based observability, and consistent base class patterns produces a codebase with strong internal consistency at scale.

What Makes It Unique OneUptime’s most technically novel contribution is shipping an MCP (Model Context Protocol) server as a first-class feature of a monitoring platform, enabling LLM agents to query incident status, acknowledge alerts, and manage monitors through a structured tool protocol designed for autonomous operation. The AI Agent service implements a complete agentic remediation loop: it detects production exceptions from live telemetry, clones the customer’s repository using a managed workspace, runs OpenCode to generate a targeted code fix with full stack trace context, and opens a pull request — all without human intervention. The inclusion of eBPF-powered distributed tracing in Kubernetes alongside native agents for Proxmox, Ceph, and Podman represents an unusually broad infrastructure surface for an open-source observability tool, competing directly with commercial platforms on infrastructure coverage.

Self-Hosting

OneUptime is released under the Apache License 2.0, one of the most permissive open-source licenses available. You can use it commercially, modify the source, distribute it, and incorporate it into proprietary products without any copyleft obligations. There is no open-core license split or feature gating in the community edition — the full platform, including the AI Copilot and MCP server, is available under the same Apache 2.0 terms.

Self-hosting OneUptime is a meaningful operational commitment. The recommended production deployment uses Kubernetes with Helm and requires running PostgreSQL, ClickHouse, Redis, and Nginx alongside the application microservices — at least seven containers in a minimal configuration, more as you add monitoring probes and AI agents. You are responsible for database backups, certificate renewal, version upgrades (which can involve schema migrations), and high-availability configuration. The Docker Compose path is explicitly labeled not recommended for production. Operational complexity scales with the number of probes, monitored services, and telemetry volume ingested into ClickHouse.

The Enterprise edition adds hardened container images, priority support with SLA guarantees, custom feature development, and data residency options — addressing the gap for regulated industries that need assured support or regional data controls. The cloud-hosted version at oneuptime.com removes all infrastructure management burden, provides automatic upgrades, and supports the project’s continued development. Teams evaluating self-hosting should weigh the cost savings against the engineering time needed for initial deployment, ongoing maintenance, and incident response for the monitoring platform itself — a real risk when the tool you rely on to detect outages is itself self-managed.

On This Page

Repository Health

Pre-computed score based on development activity, maintenance, community, maturity, and trend momentum.

90/100Excellent

Development Activity100

Maintenance100

Community64

Maturity56

Momentum40

Growing community supportVery active developmentWell-maintained with consistent updatesRapidly growing project

Technical Analysis

81/100Excellent

Architecture72

Code Quality78

Innovation82

Learning Curve90

Repository Stats

Contributors

148

Total Commits

38,851

Monthly Commits

524

Watchers

Repo Age

5 years

Last Commit

1 day ago

Built With

TypeScript81.0%

Recent Releases

100 total

~1.7 releases/month

Alternative To

Datadog Bugsnag Statuspage Rootly

Topics

devops monitoring incident-response incident-management status-page observability on-call

Related Apps

JavaScript

56%

MIT

Uptime Kuma

Monitoring

88,751

Self-hosted monitoring for every service you run — 23 monitor types, 95 notification channels, live dashboards, and public status pages with no vendor lock-in.

View details