Beta9

Serverless GPU inference and AI workloads with zero infrastructure overhead

142forks

Beta9 is an open-source serverless platform designed for developers running AI workloads such as LLM inference, background processing, and interactive sandboxes. It eliminates infrastructure management by providing a Pythonic API to deploy scalable GPU-powered applications with automatic scaling, containerized execution, and built-in storage. Built in Go and powered by AGPL-licensed code, it supports both self-hosting and integration with the managed Beam cloud platform.

The platform leverages a custom container runtime for sub-second image builds, integrates with CUDA-enabled GPUs (H100, 4090), and supports distributed volumes, queue-based autoscaling, and webhooks. It’s designed for developers who need to run fine-tuned models, process large datasets, or deploy inference endpoints without managing Kubernetes or VMs.

What You Get

  • Fast Image Builds - Launch GPU containers in under a second using a custom container runtime optimized for AI workloads
  • Parallelization and Concurrency - Fan out AI tasks across hundreds of containers simultaneously for high-throughput inference or batch processing
  • Serverless Scale-to-Zero - Workloads automatically scale to zero when idle, eliminating costs during inactivity
  • GPU Support - Run on cloud GPUs (H100, A10G, 4090) or bring your own GPU hardware for self-hosted deployments
  • Sandbox Environments - Spin up isolated, ephemeral containers to safely run LLM-generated code or experimental scripts
  • Background Task Queues - Replace Celery with built-in, retryable task queues that auto-scale based on job volume
  • Volume Storage - Mount distributed, persistent storage volumes to share data across containers or retain model outputs
  • Hot-Reloading and Webhooks - Develop locally with live code reloads and trigger workloads via HTTP webhooks or scheduled events

Common Use Cases

  • Running LLM inference at scale - A startup deploys a fine-tuned Llama 3 model as a serverless endpoint that auto-scales from 0 to 50 containers during peak API usage
  • Processing user-uploaded media at scale - A photo app uses Beta9 to run background image processing tasks on H100 GPUs without managing a dedicated cluster
  • Running experimental AI code safely - A researcher spins up a sandbox container to test untrusted LLM-generated Python code without risking their local machine
  • Replacing Celery with serverless background jobs - A SaaS company migrates its task queue to Beta9 to reduce infrastructure costs and eliminate Redis/worker node management

Under The Hood

Architecture

  • The repository exhibits a microservices-based architecture with clear separation between a Next.js frontend and backend components (gateway, worker, runner).
  • Containerization is central to the design, with each service packaged as a Docker image.
  • A Makefile streamlines build, test, and deployment processes, leveraging various tools for automation.
  • Protocol Buffers are used for API definition and inter-service communication.

Tech Stack

  • The backend is primarily built with Go, while the frontend utilizes Next.js.
  • Docker is heavily used for containerization, with multi-stage builds and specific Python versions managed via Micromamba.
  • Kubernetes orchestrates deployments, with k3d for local development and kustomize for configuration.
  • gRPC facilitates communication between backend services.

Code Quality

  • A comprehensive suite of unit and integration tests demonstrates a strong commitment to quality.
  • Error handling is present, though the use of custom error classes is limited.
  • Naming conventions are generally consistent, with some stylistic variations.
  • Type hints are used to enhance code readability and maintainability, but aren’t universally applied.
  • Serializing function results with cloudpickle is a notable pattern for remote execution.

What Makes It Unique

  • The project features a custom RunnerAbstraction class for parsing and validating resource requests.
  • Dynamic module loading via load_module_spec enables a flexible runtime environment.
  • Integration with LocalStack, including automated S3 bucket creation, provides a self-contained development and testing environment.
  • Support for GPU-accelerated workloads is indicated by the inclusion of an NVIDIA device plugin component.

Join founders buildingwith open source

Opinionated takes, migration guides, cost-saving tips, and insights from the open source ecosystem.

Subscribe on Substack

No spam. Unsubscribe anytime.

Join 750+ subscribers
No spam. Unsubscribe anytime.

Search