MLflow
The open source AI engineering platform for debugging, evaluating, monitoring, and optimizing production LLMs and agents at scale.
MLflow is the largest open source AI engineering platform for agents, LLMs, and machine learning models. It enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data. With over 60 million monthly downloads and support for 60+ frameworks, MLflow is trusted by organizations worldwide to ship AI to production with confidence.
Built on OpenTelemetry and designed for extensibility, MLflow supports Python, TypeScript, Java, and R. It provides a unified server, REST API, and web UI that integrates natively with LangChain, LangGraph, OpenAI Agents, DSPy, PydanticAI, CrewAI, LlamaIndex, AutoGen, Google ADK, Strands, and Apache Spark.
MLflow 3.x introduced a major pivot toward GenAI workflows: an integrated AI Gateway for multi-provider LLM routing, a prompt registry with versioning and Jinja2 templates, automated prompt optimization, LLM judge evaluation with the MemAlign optimizer, multi-turn conversation simulation for agent testing, distributed tracing across services, and built-in cost tracking across providers. Classical ML workflows — experiment tracking, model registry, deployment — remain fully supported alongside the new GenAI capabilities.
Deployment options include local servers, Docker, Kubernetes, and managed cloud platforms like AWS SageMaker and Azure ML, with a Databricks-managed option for enterprise teams who want fully managed infrastructure.
What You Get
- OpenTelemetry-native LLM tracing - Captures end-to-end distributed traces of LLM calls and agent workflows across any framework, recording latency, token usage, cost, and custom attributes with automatic instrumentation for 60+ libraries.
- LLM evaluation framework with 50+ judges - Run systematic evaluations using built-in scorers (groundedness, relevance, safety, tool-call correctness) or define custom LLM judges with the Judge Builder UI, then track quality metrics over time to catch regressions before production.
- Prompt registry with versioning and optimization - Version and deploy prompt templates with full lineage tracking and Jinja2 template support; automatically optimize prompts using MemAlign or GePa algorithms to improve model performance without manual iteration.
- AI Gateway with multi-provider routing - Unified OpenAI-compatible API gateway embedded in the tracking server that routes requests across providers, manages credentials, enforces rate limits, implements traffic splitting for A/B testing, and tracks usage and costs per endpoint.
- Model Registry with lifecycle management - Collaboratively manage the full lifecycle of ML models from registration through staging to production, with model versioning, metadata, lineage tracking, and deployment workflows.
- Multi-turn conversation simulation - Systematically test agent behavior by simulating conversations against goal/persona pairs, evaluating session-level quality with scorers, and distilling conversations into reusable test cases.
- Agent Server and deployment integrations - Deploy AI agents and ML models to batch and real-time scoring on Docker, Kubernetes, Azure ML, AWS SageMaker, and more with a FastAPI-based serving layer that includes automatic request validation and streaming support.
Common Use Cases
- Instrumenting a LangChain agent for production - A developer adds
mlflow.langchain.autolog()to capture full execution traces including sub-chain calls and tool invocations, then uses the MLflow UI to debug failures and configure automatic LLM judge evaluation on incoming traces without writing evaluation pipeline code. - Standardizing LLM access across teams - A platform team deploys the MLflow AI Gateway to provide all application teams with a single OpenAI-compatible endpoint that abstracts provider differences, enforces per-team rate limits, rotates credentials centrally, and logs every request for cost attribution.
- Iterating on prompt quality systematically - A product team uses the prompt registry to version 20+ variants across experiments, runs automatic evaluation with groundedness and relevance judges on a golden dataset, and uses the MemAlign optimizer to align custom judges with human feedback from annotators.
- Monitoring model quality in production - An ML team configures online scorers to automatically run safety and quality checks on every incoming trace, sets up the Overview dashboard to track latency and quality trends, and uses the conversation simulator to regression-test agent behavior before each release.
- Managing the classical ML lifecycle - A data science team tracks hyperparameters, metrics, and model artifacts across hundreds of training runs using experiment tracking, promotes the best model through staging to production in the Model Registry, and deploys it as a REST endpoint on Kubernetes.
Under The Hood
Architecture MLflow employs a deeply modular, layered architecture where the tracking server, model registry, artifact storage, and deployment targets are each encapsulated behind abstract interfaces — AbstractStore, AbstractArtifactRepository, and deployment plugin protocols. The gateway, tracing, and evaluation subsystems are self-contained modules connected through well-defined public APIs rather than shared state. Dependency injection is realized through factory functions and entry-point-based plugin resolution, allowing storage backends and deployment targets to be swapped without touching core logic. OpenTelemetry is woven through as the telemetry backbone, and the fluent API layer provides a clean facade over the lower-level client API. Multi-workspace support and the AI Gateway were added in the 3.x series without architectural breakage, demonstrating the design’s resilience to extension.
Tech Stack MLflow is a Python 3.10+ polyglot platform. The server layer runs on FastAPI with Uvicorn for async handling alongside Flask for legacy endpoints, with SQLAlchemy and Alembic for multi-database persistence across SQLite, PostgreSQL, and MySQL. OpenTelemetry SDK handles distributed tracing with OTLP export. The web UI is built in React and TypeScript, accounting for over a quarter of the codebase, using a design system shared with Databricks. Protobuf defines internal API contracts with REST serialization. Code quality is enforced by Ruff, Mypy, and Prettier via pre-commit hooks. The integration surface spans 60+ frameworks through autologging decorators and OpenTelemetry instrumentation, with support for LangChain, OpenAI, Anthropic, DSPy, CrewAI, LlamaIndex, AutoGen, Strands, and more.
Code Quality MLflow has a comprehensive test suite mirroring the production module structure with dedicated test directories for every major subsystem including tracing, evaluation, gateway, store backends, and all framework integrations. Tests use pytest with asyncio support, mocking, benchmark fixtures, and timeout enforcement. Type annotations are pervasive throughout the codebase, with Pydantic models for API contracts and dataclasses for internal entities. Error handling uses structured MlflowException hierarchies aligned with HTTP status codes, avoiding silent failures. CI enforces Ruff, Mypy, and Prettier on every pull request. The abstract store pattern enables backend-agnostic testing and consistent behavior validation across storage implementations.
What Makes It Unique MLflow’s most distinctive capability is its unified observation-evaluation feedback loop: traces captured via OpenTelemetry can be evaluated directly using LLM judges without a separate data pipeline, and judge alignment can be improved over time using the MemAlign optimizer, which learns generalized guidelines from human feedback stored as trace assessments. The AI Gateway is embedded in the tracking server rather than running as a separate process, enabling automatic trace capture of every gateway request linked to experiments. The multi-turn conversation simulator enables systematic agent testing against goal/persona pairs before deployment. Together these features create a closed-loop system for developing, testing, monitoring, and iterating on AI applications that no other open source tool covers end-to-end.
Self-Hosting
MLflow is released under the Apache License 2.0, a permissive open source license that imposes no restrictions on commercial use, modification, or distribution. You can use MLflow in proprietary software, sell services built on it, and modify the source code without any obligation to release your changes. The copyright holder is Databricks, Inc., but the Apache 2.0 license grants all users the same rights regardless of company size or intended use.
Running MLflow yourself requires meaningful operational investment. A production deployment needs a persistent database (PostgreSQL or MySQL recommended over SQLite for concurrency), an artifact storage backend such as S3, GCS, or Azure Blob Storage, and a process manager for the tracking server. The team is responsible for database migrations when upgrading MLflow versions, backup and disaster recovery, horizontal scaling behind a load balancer for high availability, and securing the server with authentication. The basic auth plugin is included but SSO with OIDC or SAML requires configuration and maintenance. Larger deployments with many concurrent agents or high trace volume will require database tuning and storage lifecycle management.
Databricks offers MLflow as a fully managed service within the Databricks Lakehouse Platform, which adds enterprise features not available in the open source version: automatic high availability and scaling, Unity Catalog integration for fine-grained access control, managed vector stores, enterprise SSO, SLAs, and dedicated support. Organizations evaluating self-hosting should weigh the engineering overhead of operating the database, artifact store, and server process against the cost of Databricks. Teams with existing cloud infrastructure and DevOps capacity typically find self-hosting viable; teams without dedicated ML platform engineers often find the managed offering more cost-effective at scale.