Netdata

Real-time per-second metrics, ML-powered anomaly detection, and zero-config observability for any infrastructure.

78Kstars
6.4Kforks
GNU General Public License v3.0
C

Netdata is an open-source, real-time infrastructure monitoring platform that collects, stores, and visualizes every metric at one-second resolution — with no configuration required. It auto-discovers services, containers, virtual machines, hardware sensors, and applications on over 100 platforms including Linux, macOS, Windows, FreeBSD, Docker, and Kubernetes, then immediately surfaces dashboards without any manual setup.

The agent runs a full observability pipeline at the edge: collecting 800+ integrations worth of metrics, training per-metric machine learning models for anomaly detection, evaluating hundreds of pre-configured health alerts, streaming data to parent nodes for centralized retention, and exporting to Prometheus, InfluxDB, Graphite, and OpenTelemetry — all while consuming only about 5% CPU and 150 MB RAM per node.

Netdata’s three-part ecosystem consists of the GPL v3-licensed agent (the core engine), the Netdata Cloud service (optional, adds RBAC, centralized alerts, and multi-node dashboards without centralizing raw metrics), and the Netdata UI (interactive dashboards included in packages). The parent-child streaming architecture scales from a single server to millions of metrics per second across complex multi-cloud environments.

Netdata is a CNCF member project and one of the most starred observability tools in the CNCF landscape. It has been recognized by the University of Amsterdam as the most energy-efficient Docker monitoring tool, and its built-in MCP server enables AI agents to query infrastructure metrics directly.

What You Get

  • Per-Second Real-Time Dashboards - Automatically generated, interactive dashboards that visualize every collected metric at one-second resolution with zero query language required, available at http://localhost:19999 immediately after installation.
  • 800+ Auto-Discovered Integrations - Collectors for systems, containers, databases (PostgreSQL, MySQL, MongoDB, Redis, ClickHouse, and more), web servers, message queues, cloud providers, and hardware sensors that activate automatically when a service is detected.
  • Edge-Based ML Anomaly Detection - Unsupervised k-means models trained per metric on the agent itself, scoring every data point for anomalies and surfacing correlated anomalies across metrics when something goes wrong.
  • Tiered Long-Term Storage - A custom DBENGINE time-series database storing metrics at approximately 0.5 bytes per sample across three tiers (per-second, per-minute, per-hour), enabling over a year of retention on modest disk space.
  • Distributed Health Alerting - Hundreds of pre-configured alert templates that evaluate independently on each agent and parent node, with notifications to Slack, PagerDuty, Telegram, email, Discord, Microsoft Teams, and custom shell scripts.
  • MCP Server for AI Agents - A built-in Model Context Protocol server that exposes infrastructure metrics, alerts, anomaly scores, and correlated metric queries to AI coding assistants and automation agents.
  • Parent-Child Streaming - Native metric streaming between agents and parent nodes for centralized dashboards, longer retention, and fleet-wide alert management without centralizing raw data in an external service.
  • Multi-Platform Export - Configured exporters to push metrics to Prometheus remote write, InfluxDB, Graphite, OpenTSDB, AWS Kinesis, MongoDB, and OpenTelemetry endpoints.

Common Use Cases

  • Live Infrastructure Troubleshooting - SREs use per-second CPU, memory, disk I/O, and network metrics to pinpoint performance regressions during incidents without waiting for metric scrape intervals.
  • Kubernetes Cluster Monitoring - Platform engineers deploy Netdata as a DaemonSet to gain per-pod, per-container, and per-namespace metrics alongside node-level hardware and kernel metrics with auto-discovery of new workloads.
  • Database Query Performance Observability - DBAs use Netdata’s interactive Top Queries function to identify slow queries, long-running operations, and deadlocks across PostgreSQL, MySQL, MongoDB, Redis, and 14 other databases directly from the dashboard.
  • Lean Team Observability - Small engineering teams replace Prometheus + Grafana stacks with Netdata to eliminate the overhead of writing PromQL queries, configuring scrape jobs, and building dashboards from scratch.
  • Fleet-Wide Anomaly Detection - Operations teams connect hundreds of servers to Netdata Parents and Cloud to surface correlated anomalies across the entire fleet, using the scoring engine to find patterns and root causes during degradations.
  • On-Premises Compliance Monitoring - Security-conscious organizations self-host Netdata without any metrics leaving their infrastructure, using local parent nodes for centralization and the optional cloud plane only for UI access and alert routing.

Under The Hood

Architecture Netdata implements a distributed observability pipeline built around a modular plugin architecture. The core daemon orchestrates a collection of specialized plugins written in C, Go, Python, and Bash, each responsible for a subsystem or set of integrations. The plugin communication protocol uses Unix pipes with a defined text format, isolating collection failures to individual plugins without destabilizing the agent. A parent-child streaming layer implements replication of the full metric stream over TCP, enabling multi-level hierarchies where children offload storage and alert evaluation to parents. The ACLK (Agent-Cloud Link) subsystem handles the optional bidirectional connection to Netdata Cloud using MQTT over TLS without exposing metrics to the cloud — only metadata and alert transitions traverse the cloud plane. The health evaluation subsystem runs independently at every level, re-evaluating alert templates against local metric data using a custom expression language and triggering notification actions through shell scripts.

Tech Stack The agent core is written in C for maximum efficiency, with the ML subsystem implemented in C++ using k-means clustering from dlib. The go.d plugin (Go 1.26) provides 137 modern collectors with testify-based unit testing and a shared collector framework handling configuration, service discovery, and function execution. The journal-viewer plugin is implemented in Rust using a custom systemd-compatible journal format for OpenTelemetry log ingestion. The custom DBENGINE is a tiered time-series database using memory-mapped journal files with ZSTD compression, achieving approximately 0.5 bytes per sample. The web server is built-in, serving a React/JavaScript dashboard via CDN. The build system uses CMake with extensive platform detection across Linux, macOS, FreeBSD, and Windows. CI runs on GitHub Actions with CodeQL analysis, Coverity scanning, ARM architecture builds, and a dedicated Go test workflow.

Code Quality The Go plugin codebase has extensive unit test coverage with 704 test files following Go testing conventions, httptest-based mock servers for HTTP collectors, and shared testdata fixtures. The C codebase has 18 unit test files and a dedicated test runner script, though C coverage is less systematic than the Go side. Error handling in C uses explicit return codes and centralized logging; the Go collectors use structured error wrapping. The codebase enforces consistent formatting (gofmt for Go, clang-format references for C), uses SPDX license headers on every file, and maintains a CONTRIBUTING.md guide. GitHub Actions runs checks on PRs including markdown validation, integration metadata validation, and multi-platform package builds.

What Makes It Unique Netdata’s most distinctive technical choice is training unsupervised machine learning models per metric at the edge, on each agent, using k-means clustering with a two-cluster center approach that adapts to each metric’s individual behavior pattern. This contrasts with cloud-based or centralized ML systems — anomaly detection runs with zero data leaving the host and with no threshold configuration. The MCP server integration is unique among open-source monitoring tools, exposing infrastructure queries, anomaly scoring, correlated metric search, and alert management as MCP tools consumable directly by AI agents. The NIDL (Node, Instance, Dimension, Label) data model enables automatic dashboard generation without any manual query or visualization configuration, and the scoring engine can surface weighted anomaly correlations across thousands of metrics simultaneously to assist root cause analysis.

Self-Hosting

Netdata’s core agent is released under the GNU General Public License version 3, a strong copyleft license. This means you can use it freely in commercial environments, modify the source, and redistribute it, but any modified versions you distribute must also be released under GPL v3. For most self-hosting scenarios — running Netdata on your own servers to monitor your own infrastructure — the GPL v3 presents no practical restrictions, since you are not distributing the software to others.

Running Netdata yourself is operationally straightforward for single-node or small deployments. The one-line installer handles dependencies and service configuration on all major Linux distributions, Docker, and Windows. For fleet deployments, you need to configure parent nodes manually (streaming configuration files or via the UI), manage the CMake-based source builds or package upgrades across your fleet, handle disk retention sizing for the DBENGINE, and own the uptime of any parent nodes that serve as centralized dashboards or alert evaluators. The agent process is lightweight (~5% CPU, ~150 MB RAM at full ML operation), but parent nodes aggregating thousands of child agents require proportionally more memory and CPU. Alert notification delivery relies on your configured notification methods (email MTAs, webhook destinations, etc.) and is your operational responsibility.

Netdata Cloud is the optional managed layer that adds role-based access control, SSO, centralized alert routing, multi-node composite dashboards, UI-driven configuration of collectors and alerts, and horizontal scalability beyond what local parent nodes provide. Critically, Netdata Cloud does not store your raw metrics — it only retains metadata and alert transitions, so your metric data never leaves your infrastructure. The free community tier covers most individual use cases. Paid Business and Enterprise plans add RBAC with multiple users, more Spaces, priority support with SLAs, and advanced security controls. On-premises versions of Netdata Cloud (Netdata On-Prem) are available for organizations that cannot connect to any external service at all.

Join founders buildingwith open source

Opinionated takes, migration guides, cost-saving tips, and insights from the open source ecosystem.

Subscribe on Substack

No spam. Unsubscribe anytime.

Join 750+ subscribers
No spam. Unsubscribe anytime.

Search