Ollama

Run Llama, Gemma, DeepSeek, and other open LLMs on your own machine with one command and an OpenAI-compatible API.

175.5Kstars
16.8Kforks
MIT License
Go

Ollama is a Go-based runtime for downloading, running, and serving large language models locally, backed by Y Combinator (W23). It wraps llama.cpp and a growing set of GPU backends (CUDA, ROCm, Vulkan, Metal, and Apple’s MLX) behind a single CLI and REST API, so ollama run gemma4 is enough to pull a quantized model and start chatting without touching Python, CUDA toolkits, or model-format conversion scripts.

Under the surface, Ollama is a client-server system: a lightweight HTTP server (built on gin) exposes /api/generate, /api/chat, /api/embed, and model-management endpoints, plus an OpenAI-compatible surface so existing OpenAI SDK code can point at localhost:11434 with minimal changes. A scheduler in the server package tracks loaded models, VRAM budgets, and concurrent requests, deciding when to load, share, evict, or retry a model across available GPUs or fall back to CPU.

Models are distributed as Modelfiles — declarative configs that pin a base GGUF model, system prompt, template, and runtime parameters — and stored as content-addressable blobs pulled from Ollama’s own model registry, similar in spirit to how Docker images work. This makes versioning, sharing, and swapping models between machines straightforward without re-downloading multi-gigabyte weights every time.

Beyond local inference, Ollama has grown into an integration hub: it ships CLI launchers for coding agents like Claude Code, Codex, and Copilot CLI, an experimental cloud-offload mode for models too large for consumer hardware, and official client libraries in Python and JavaScript, making it a common default backend for self-hosted chat UIs, RAG pipelines, and agent frameworks.

What You Get

  • A CLI (ollama run, pull, create, list, ps) for pulling, running, and managing local models with no separate Python environment required
  • A local REST API on port 11434 with /api/chat, /api/generate, /api/embed, and streaming responses, plus an OpenAI-compatible endpoint set
  • Automatic hardware detection and backend selection across NVIDIA CUDA, AMD ROCm, Vulkan, and Apple Metal/MLX, with CPU fallback
  • A scheduler that manages multiple loaded models, VRAM allocation, and automatic eviction/retry when a model load fails or runs out of memory
  • Modelfiles for declaring custom system prompts, parameters, and templates on top of a base model, versioned like container images
  • Official Python and JavaScript client libraries, plus one-command launchers for coding agents such as Claude Code, Codex, and Copilot CLI

Common Use Cases

  • Running open-weight chat and coding models entirely offline for privacy-sensitive or air-gapped environments
  • Serving as the local inference backend for self-hosted chat UIs like Open WebUI, LibreChat, and similar community front ends
  • Powering local RAG pipelines and agent frameworks that need an OpenAI-compatible endpoint without sending data to a third-party API
  • Prototyping and evaluating multiple open LLMs (Llama, Gemma, DeepSeek, Qwen, Mistral) quickly by swapping model names in one CLI
  • Giving coding agents and CLI copilots a local, cost-free model backend during development

Under The Hood

Architecture Ollama is a layered Go application: a cobra-based CLI in cmd/ talks over HTTP to a gin-based server in server/, which exposes REST endpoints for chat, generate, embeddings, and model management. A scheduler component tracks which models are loaded, on which GPU, and how much memory each consumes, deciding whether to reuse a running model, load a new one, or evict an idle one under memory pressure, with retry logic for out-of-memory load failures. Below the server sits a hardware discovery layer that probes for CUDA, ROCm, Vulkan, and Apple Metal/MLX devices at startup, and an ml/llm layer that shells out to compiled inference runners (built on llama.cpp and related C++ engines) as subprocesses. Models themselves are stored as content-addressable blobs referenced by Modelfiles, giving the whole system a container-registry-like feel for distributing and versioning weights.

Tech Stack The core is written in Go against a comprehensive module list: gin-gonic for HTTP routing, cobra for the CLI, bubbletea and lipgloss for terminal UI elements, and mattn/go-sqlite3 for local state. The inference engine layer is C/C++ (llama.cpp and GGML) built via CMake alongside the Go toolchain, with platform-specific build paths for CUDA, ROCm, Vulkan, and Apple’s Metal/MLX stacks. Distribution covers native installers for macOS, Windows, and Linux plus an official Docker image, and the project ships client libraries in Python and JavaScript for application integration.

Code Quality The repository carries an extensive test suite spread across the codebase, alongside multiple GitHub Actions workflows covering unit tests, installer verification, and upstream llama.cpp update checks. Linting is enforced through a comprehensive golangci-lint configuration enabling checks like bodyclose, nilerr, and wastedassign, and HTTP handlers consistently use typed error checks with explicit status codes rather than silent failures. Naming and package boundaries are clear (server, discover, llm, ml, api are each single-purpose), which keeps a large codebase navigable despite its size.

What Makes It Unique Ollama’s differentiation isn’t the inference engine itself — that’s largely delegated to llama.cpp and similar backends — but the surrounding developer experience: automatic cross-vendor GPU detection with graceful fallback, a memory-aware scheduler that juggles multiple models without manual VRAM tuning, Docker-like Modelfiles for reproducible model configs, an OpenAI-compatible API for drop-in compatibility with existing tooling, and direct one-command integration with popular coding agents. The result is less a novel inference algorithm and more a comprehensive packaging layer that made local LLMs practical for a broad, non-specialist developer audience.

Self-Hosting

Licensing Model Ollama is released under the MIT license, one of the most permissive open source licenses available. The full source in this repository — CLI, server, scheduler, and GPU discovery code — is open and unrestricted.

Self-Hosting Restrictions There are no license gates, feature flags, or paywalls in the self-hosted application. Anyone can build from source or use the official binaries/Docker image to run the full feature set locally, including multi-GPU scheduling and Modelfile-based customization.

Cloud vs Self-Hosted Ollama also operates an optional hosted service (ollama.com) that offers cloud-offloaded inference for very large models that don’t fit on local hardware, accessed via ollama signin and the same CLI/API. This cloud tier is an added convenience layered on top of the open core, not a requirement — the self-hosted binary is fully functional without ever creating an account.

License Key Required No license key is required for any local functionality. An ollama.com account is only needed to opt into the cloud-model proxy feature.

Join founders buildingwith open source

Opinionated takes, migration guides, cost-saving tips, and insights from the open source ecosystem.

Subscribe on Substack

No spam. Unsubscribe anytime.

Join 750+ subscribers
No spam. Unsubscribe anytime.

Search