Cog
An open-source CLI that packages machine learning models into standard, production-ready Docker containers — no Dockerfile wrangling, no CUDA version hell.
Cog is an open-source command-line tool, built and maintained by Replicate, that packages machine learning models into standard, production-ready Docker containers. Instead of hand-writing a Dockerfile and fighting CUDA, cuDNN, PyTorch, and TensorFlow version mismatches, you describe your model’s environment in a short cog.yaml file and Cog generates an optimized, multi-stage Dockerfile for you — picking the right Nvidia base image, pinning the right Python version, and layering dependencies efficiently for fast rebuilds.
On the model side, you define a predict.py (or the newer BaseRunner class) using plain, typed Python: a setup() method that loads weights once, and a predict()/run() method that takes typed Input arguments and returns typed outputs like Path or str. From those type annotations, Cog automatically derives an OpenAPI schema and validates every request and response against it, so there’s no API to hand-write and no schema to keep in sync.
Every image Cog builds embeds a self-contained HTTP inference server, so running the container immediately exposes a RESTful /predictions endpoint — docker run is enough to get a working model API on any machine with Docker (and a GPU, if the model needs one). Under the hood this server is written in Rust (the coglet crate, bridged into the Python process via coglet-python), giving the request-handling path native performance while keeping the model-authoring experience in familiar Python.
Cog is fully standalone software: the resulting Docker image is vanilla Docker with no proprietary runtime, so it can be deployed to your own infrastructure, any cloud, or a Kubernetes cluster — pushing to Replicate’s hosted platform is entirely optional. It’s Apache 2.0 licensed, has no paid tier or feature gate, and is developed in the open with an active release cadence and a large community of contributors.
What You Get
- A Go-based CLI (
cog build,cog run,cog serve,cog push,cog train,cog predict) covering the full local-iteration-to-deployment workflow - Automatic Dockerfile generation from a simple
cog.yaml, including Nvidia base image selection and dependency layer caching - A maintained CUDA/cuDNN/PyTorch/TensorFlow compatibility matrix so you don’t have to guess which versions work together
- A typed Python prediction interface (
Input,Path,BaseRunner) that Cog uses to auto-generate an OpenAPI schema and validate requests - A built-in, Rust-powered HTTP inference server embedded in every image, so containers serve a
/predictionsendpoint out of the box - An optional training interface (
train()) for exposing a fine-tuning API alongside inference - Cross-platform installers (Homebrew tap, install script, manual binaries) for macOS, Linux, and Windows via WSL2
Common Use Cases
- Packaging a research model (PyTorch, TensorFlow, or otherwise) into a reproducible, deployable Docker image without needing an infrastructure engineer
- Standardizing how an ML platform team packages and deploys every model, so each one exposes the same HTTP prediction API
- Self-hosting an open-weight model’s inference API on your own GPU infrastructure with no dependency on a hosted inference provider
- Building a reproducible ML artifact whose behavior is pinned to a content-addressable Docker image, including the model weights themselves
- Optionally publishing a Cog-packaged model to Replicate’s hosted platform while retaining a portable, vendor-neutral underlying artifact
Under The Hood
Architecture
Execution starts at the CLI entry point, which wires a command-based Go CLI exposing subcommands like build, run, serve, push, train, predict, exec, and doctor. When a user builds a model, the CLI reads and validates the project’s configuration file, layering in build options, environment variables, and a maintained CUDA/cuDNN/PyTorch/TensorFlow compatibility matrix, then hands off to a dedicated Dockerfile generator that synthesizes a multi-stage Dockerfile with the correct GPU base image, dependency layer caching, and Python version pinning. The generated image embeds the Python prediction interface alongside a native inference server that bridges the typed Python predictor to a high-performance HTTP server, so at runtime the container exposes a self-describing prediction endpoint whose schema is derived directly from the predictor’s own type annotations. This is a genuinely layered design — one language orchestrates build-time concerns, another owns the runtime request path for performance, and Python remains the surface model authors actually write against — cleanly separating each concern while keeping the authoring experience approachable.
Tech Stack
The project is deliberately polyglot. The CLI and build tooling are written in Go, using a mature command-parsing library alongside the official Docker CLI and daemon SDKs and a leading buildkit library for image construction, plus container-based integration testing and registry-interaction libraries. The runtime inference server is written in Rust, compiled as a native extension and exposed back to Python as an installable package with a pinned version range. The Python-facing SDK targets a wide range of modern Python versions and depends on a small, focused set of libraries for YAML parsing, structured logging, and HTTP requests, and is packaged using a version-stamping build backend. Tooling is unusually modern for infrastructure software: a single version-pinning manager covers every toolchain, a fast Python resolver manages dependencies, and strict linting/type-checking tools are enforced across both the Python and Go portions of the codebase.
Code Quality
Testing is extensive and layered: well over a hundred Go test files exist across the core packages, with some individual test files running past a thousand lines; a substantial Python test suite covers the predictor, model, input, and type-validation logic; a dedicated integration-test suite exercises real build and run flows end-to-end; and the Rust runtime carries its own unit tests. Continuous integration is genuinely rigorous — it matrices testing across multiple supported Python versions, shards integration tests across several parallel runners, generates interface mocks automatically, enforces strict linting and type-checking across every language in the repo, and runs dependency-vulnerability and static-analysis security scanning. Error handling favors explicit, typed error types and dedicated configuration-validation errors rather than generic failures, and naming throughout is consistent and domain-driven. This reflects production-grade engineering discipline rather than a side project.
What Makes It Unique
What distinguishes this project from generic “just wrap it in Docker” advice or comparable model-serving frameworks is that it doesn’t merely template a Dockerfile — it maintains and continuously updates real, tested compatibility data encoding which GPU driver, CUDA, and deep-learning-framework combinations actually work together, something most teams otherwise learn through extensive trial and error. It also treats a model as strictly “just a function,” deriving its entire API contract from ordinary, typed Python code rather than requiring a hand-written specification. The hybrid multi-language implementation is unusual for this category: a native runtime component was introduced specifically to make in-container request serving fast without sacrificing the ergonomics Python offers model authors. Because the output is a vanilla, portable container image with no proprietary runtime baked in, packaged models run identically on a laptop, on self-managed GPU infrastructure, or on a commercial hosting platform, avoiding vendor lock-in despite being built by a company that also sells hosting.
Self-Hosting
Cog is fully open-source software licensed under Apache 2.0, with no separate paid tier, license key, or feature gate anywhere in the codebase — a review of the repository found no ee, enterprise, or pro directories, no license-check or feature-flag code (isPro/isEnterprise/requiresLicense), and no pricing or enterprise section in the README. Every capability described in this listing, including GPU compatibility resolution, the built-in HTTP inference server, and the training interface, is available to any user who installs the CLI. Cog is designed to be run entirely on your own infrastructure: the Docker images it produces are vanilla Docker with no dependency on any hosted service, so teams can build, run, and serve models fully self-hosted, on their own cloud, or on-prem GPU hardware. Pushing a packaged model to Replicate’s commercial hosting platform is presented in the documentation as one optional deployment target among several, not a requirement — self-hosting is a first-class, fully supported use case with no functional restrictions compared to using Replicate.
Related Apps
Ollama
AI Development · Developer Tools
Run Llama, Gemma, DeepSeek, and other open LLMs on your own machine with one command and an OpenAI-compatible API.
Ollama
MITLangflow
AI Agents · AI Development
Build, test, and deploy AI agents and RAG workflows visually with native API and MCP server export.
Langflow
MITDify
No Code Platforms · AI Development · Developer Tools
Visual LLM workflow platform with RAG pipelines, agent capabilities, and model management for building production AI applications.