Keep is an open-source AIOps platform designed for SREs, DevOps engineers, and operations teams managing complex, multi-source alert environments. It solves alert fatigue and noise by consolidating alerts from diverse monitoring systems into a unified interface with intelligent deduplication, enrichment, and correlation. Keep enables teams to reduce MTTR by automating incident response and providing context-rich insights before human intervention.
Built in Python and designed for extensibility, Keep supports bidirectional integrations with 110+ tools including Datadog, Prometheus, Grafana, Jira, PagerDuty, and CloudWatch. It features a YAML-based workflow engine similar to GitHub Actions, AI backends (OpenAI, Anthropic, Ollama, Gemini), and supports both self-hosted and cloud deployments via Docker or Kubernetes.
What You Get
- Single Pane of Glass - Unified dashboard to view, filter, and query alerts from 110+ monitoring tools including Datadog, Prometheus, Grafana, CloudWatch, and Jira in one interface.
- Alert Deduplication & Correlation - Automatically groups related alerts into high-fidelity incidents using rule-based and AI-powered correlation to reduce noise and alert fatigue.
- AI-Powered Enrichment & Summarization - Uses AI backends like OpenAI, Anthropic, Gemini, and Ollama to summarize alert context, suggest root causes, and generate incident reports.
- Bi-Directional Integrations - Syncs alerts and actions with monitoring tools (Datadog, Prometheus), ticketing (Jira), incident management (PagerDuty), and CMDBs (NetBox) in both directions.
- Workflow Automation Engine - YAML-based workflows (like GitHub Actions) to query MySQL, update Jira tickets, run Python scripts, and enrich alerts with external data automatically.
- Provider Health Checker - Real-time assessment of alert quality and integration health across Datadog, CloudWatch, Grafana, PagerDuty, and GCP Monitoring without requiring signup.
Common Use Cases
- Managing alert noise in multi-cloud environments - An SRE uses Keep to consolidate alerts from Datadog, CloudWatch, and Prometheus, deduplicate duplicates, and correlate related events into single incidents to reduce alert fatigue.
- Automating incident response workflows - A DevOps team triggers a Jira ticket creation and Slack notification via Keep’s workflow engine when a critical Prometheus alert fires, reducing manual triage time.
- Enriching alerts with contextual data - An operations engineer enriches a CloudWatch alert with data from NetBox (CMDB) and MySQL (service ownership) to automatically assign ownership and add context before escalation.
- Enterprise alert correlation at scale - A global enterprise uses Keep Enterprise to correlate thousands of daily alerts using AI models trained on past incidents, identifying recurring patterns and reducing false positives.
Under The Hood
Architecture
- Clean separation of frontend and backend via environment-configured HTTP boundaries, enabling independent deployment and scaling
- Provider-based backend architecture with abstract interfaces and concrete implementations for authentication and monitoring, ensuring extensibility and dependency inversion
- WebSocket and background task processing isolated into dedicated services with Redis as the message broker, guaranteeing non-blocking real-time updates and async alert handling
- Configuration and secrets managed through environment variables and file-based managers, with environment-specific Docker Compose setups for flexible deployment
- Monitoring stack externally integrated via configuration, preserving backend modularity while maintaining comprehensive observability
Tech Stack
- Python 3.11+ backend powered by FastAPI and SQLModel with SQLAlchemy 2.0, supporting multiple database backends through standardized connectors
- Next.js frontend with NextAuth, integrated with Sentry, PostHog, and Pusher for authentication, analytics, and real-time notifications
- Distributed task processing via ARQ with Redis, complemented by Soketi for WebSocket communication
- Full-stack observability using OpenTelemetry for traces and metrics, with Loki, Tempo, and Prometheus for centralized logging and monitoring
- Production-grade deployment via Docker Compose with CI/CD automation, semantic release management, and support for Kubernetes and major cloud platforms
Code Quality
- Extensive test coverage across unit, integration, and end-to-end scenarios with robust mocking of external dependencies
- Strong type safety enforced through Pydantic schemas and custom validators, preventing malformed configurations at runtime
- Structured error handling via validation exceptions rather than custom classes, ensuring predictable and consistent failure modes
- Modular test fixtures for databases and services enable isolated, repeatable test execution with clear separation of concerns
- Consistent naming conventions and declarative workflow definitions enhance readability and long-term maintainability
- Comprehensive linting and schema validation embedded in configuration systems to prevent invalid provider setups before execution
What Makes It Unique
- Native CEL expression engine that transforms string-based alert conditions into numeric comparisons with automatic severity normalization, enabling precise rule evaluation without external dependencies
- Enrichment event system with hierarchical data modeling and audit-trail-aware logging, deeply integrated into the alert pipeline
- Topology service supporting hybrid automation by allowing human-defined relationships to coexist with auto-discovered infrastructure maps
- Decoupled alert enrichment and rule execution layers that enable independent scaling of data transformation and routing logic
- Event-driven enrichment pipeline with timestamped partitioning optimized for high-performance time-series alert analytics
- Seamless integration of SQLModel with Pydantic to deliver type-safe APIs while retaining full ORM flexibility for database operations