Promptfoo
An open-source CLI and library for evaluating and red-teaming LLM applications — replace trial-and-error prompt engineering with systematic evals, vulnerability scanning, and CI/CD integration.
Promptfoo treats LLM application quality as a testing problem rather than a guessing game: instead of manually eyeballing model outputs, you define test cases and assertions, run them across prompts/models/configurations, and get structured pass/fail results comparable in a CI pipeline the way unit tests would be.
Beyond correctness evals, Promptfoo has a dedicated red-teaming mode for probing LLM applications for vulnerabilities — prompt injection, jailbreaks, data leakage, and other adversarial failure modes — treating security testing as part of the same evaluation workflow rather than a separate concern.
The project was acquired by OpenAI but remains MIT-licensed and fully open source per the project’s own announcement, distributed as an npm package (CLI and library) with active development and a large community.
What You Get
- A CLI and library for defining test cases and assertions against LLM outputs, run consistently across prompts and models
- Dedicated red-teaming and vulnerability scanning for prompt injection, jailbreaks, and other adversarial LLM risks
- CI/CD integration so prompt and model regressions are caught automatically like any other test suite
- Side-by-side comparison of outputs across different prompts, models, or configurations
Common Use Cases
- Regression-testing prompts and model changes in CI before deploying an LLM-powered feature
- Red-teaming an LLM application for prompt injection and jailbreak vulnerabilities before launch
- Comparing output quality across different models or prompt variations systematically instead of manual spot-checking
- Building a documented, repeatable eval suite for an LLM application instead of relying on ad hoc testing
Under The Hood
Architecture
Promptfoo’s src/ separates assertions (the pluggable pass/fail check logic), evaluate.ts (the core eval-running engine), codeScan (likely static analysis for vulnerability detection), commands (CLI entry points), and a database layer for storing eval results, with a separate app/ directory for a web UI on top of the CLI/library core. This split lets the same evaluation engine serve CLI users, library consumers, and a browsable results UI without duplicating the core logic.
Tech Stack TypeScript throughout, distributed as an npm package usable both as a CLI tool and as a library, with a web app component for browsing eval results. It integrates with CI/CD systems as a test-runner-style tool rather than requiring a hosted service.
Code Quality Very active, consistently maintained commit history and a large contributor/community base (per GitHub activity and Discord presence) reflect a mature, production-used tool — reinforced by its acquisition into OpenAI while remaining open source, which typically implies continued investment rather than a stalled side project.
What Makes It Unique Promptfoo treats LLM correctness evals and adversarial red-teaming as the same underlying workflow (define cases, run them, get structured results) rather than requiring separate tools for quality testing versus security testing — letting teams catch both a broken prompt and an exploitable jailbreak in the same CI step.
Self-Hosting
Licensing Model MIT licensed — the project explicitly states it “remains open source and MIT licensed” after being acquired by OpenAI.
Self-Hosting Restrictions None found for the open-source CLI/library and its eval-running functionality.
License Key Required No.
Related Apps
Ollama
AI Development · Developer Tools
Run Llama, Gemma, DeepSeek, and other open LLMs on your own machine with one command and an OpenAI-compatible API.
Ollama
MITLangflow
AI Agents · AI Development
Build, test, and deploy AI agents and RAG workflows visually with native API and MCP server export.
Langflow
MITDify
No Code Platforms · AI Development · Developer Tools
Visual LLM workflow platform with RAG pipelines, agent capabilities, and model management for building production AI applications.