GPU Hot is a real-time NVIDIA GPU monitoring tool designed for developers, DevOps engineers, and MLOps teams who need immediate visibility into GPU resource usage without relying on cloud services or complex infrastructure. It solves the problem of opaque GPU utilization in training, inference, and high-performance computing environments by providing a self-hosted, browser-accessible dashboard with sub-second updates.
Built with Python, FastAPI, and Chart.js, it runs in Docker using the NVIDIA Container Toolkit and supports both single-node and multi-node cluster monitoring via a hub-and-spoke architecture. The system collects metrics directly from NVML and falls back to nvidia-smi for older GPUs, with WebSocket streaming for live updates and a REST API for programmatic access.
What You Get
- Real-time GPU metrics - Sub-second updates for utilization, temperature, memory usage, power draw, fan speed, clock speeds, PCIe bandwidth, P-State, and throttle status using NVML and nvidia-smi.
- Process-level monitoring - Tracks active GPU processes including PID, memory consumption, and process names when run with —init —pid=host for detailed resource attribution.
- Multi-GPU and multi-node support - Automatically detects multiple GPUs on a single host and aggregates metrics from 1 to 100+ nodes via a hub mode using NODE_URLS environment variable.
- Historical charts - Interactive time-series visualizations for GPU utilization, temperature, power, and clocks powered by Chart.js with configurable update intervals.
- System-wide metrics - Monitors host CPU usage, RAM, swap, disk I/O, and network traffic alongside GPU data for holistic system health assessment.
- REST and WebSocket APIs - Exposes /api/gpu-data for JSON snapshots and /socket.io/ for real-time streaming of GPU, process, and system metrics to custom dashboards or automation tools.
Common Use Cases
- Running LLM inference clusters - ML engineers use GPU Hot to monitor GPU memory leaks and thermal throttling during large language model inference workloads across multiple servers.
- Training job optimization - Data scientists track GPU utilization spikes and idle time to optimize batch sizes and training schedules in PyTorch/TensorFlow environments.
- Data center GPU resource sharing - DevOps teams deploy GPU Hot in hub mode to aggregate metrics from dozens of GPU nodes into a single dashboard for centralized monitoring and alerting.
- MLOps pipeline health checks - Teams integrate GPU Hot’s /api/gpu-data endpoint into CI/CD pipelines to fail builds if GPU memory exceeds thresholds or temperatures rise abnormally.
Under The Hood
Architecture
- Monolithic FastAPI application with tightly coupled configuration, monitoring, and routing logic in a single entry point, violating separation of concerns
- Hub and monitor modes share nearly identical WebSocket handler structures with duplicated code and no abstract interface to unify behavior
- Metrics collection is hard-coded to NVIDIA-specific tools with no pluggable backend or dependency injection
- Configuration is mutated globally across modules via direct imports, creating implicit state that undermines modularity
- Shallow directory structure with no clear layering between data collection, business logic, and API routing
Tech Stack
- Python 3 backend powered by FastAPI with REST and WebSocket endpoints for real-time GPU telemetry
- Dockerized deployment with native NVIDIA GPU access via nvidia-docker2 and host PID namespace for system-level monitoring
- Lightweight frontend built with HTML, CSS, and JavaScript for real-time visualization
- Minimalist dependency profile relying primarily on the standard library and FastAPI, with no external frameworks
- Consistent development environment enforced through .editorconfig for code style uniformity
Code Quality
- Extensive test infrastructure simulates distributed GPU workloads but lacks unit tests for core logic
- Error handling is pervasive yet superficial, relying on bare except blocks and silent failures without structured logging
- Type annotations are inconsistently applied, with critical paths like WebSocket handlers lacking type safety
- Naming conventions are mostly descriptive, though some variables lack contextual clarity
- Absence of static analysis or linting tools increases risk of runtime errors and uncaught edge cases
What Makes It Unique
- Native dual-mode architecture enables seamless scaling from single-GPU workstations to multi-node clusters without code duplication
- Intelligent fallback from NVML to nvidia-smi with deep hardware telemetry including PCIe link width and encoder/decoder metrics
- Real-time cluster-wide aggregation via WebSocket hub that preserves per-device granularity across heterogeneous nodes
- Dynamic polling intervals adapt based on underlying API to optimize resource usage while maintaining data fidelity
- First-class version auto-update detection with semantic versioning and GitHub release parsing integrated directly into the API