Python eval suites that version every prompt, track the cost of every call, and catch drift with statistical tests. Edit prompts live, capture traces, set budgets. One SQLite file on your laptop, no account required.
From the first commit to the cost of every call in production.
track() call is content-hashed. Diff any two versions.@suite.render_prompt() serves dashboard-edited templates live — no redeploy.No server, no account, no telemetry. Just a CLI and a SQLite file.
Code instead of YAML. Local instead of cloud. Deterministic assertions instead of a metered LLM-judge tax. Together, in one tool.
| Dimension | promptry | Promptfoo | RAGAS | LangSmith | DeepEval |
|---|---|---|---|---|---|
| Authoring surface | Python decorators | YAML | Python metrics | SaaS UI | Python |
| Data location | Local SQLite | Local | Local | Their cloud | Local + push |
| Account required | No, ever | No | No | Yes | No (OSS) |
| CI cost per run | $0 deterministic | Mixed | Per-judge-call | Trace volume | Per-judge-call |
| Prompt versioning | Automatic content-hash | Manual + git | · | Prompt Hub | · |
| Live prompt editing | Dashboard, no redeploy | · | · | Prompt Hub (cloud) | · |
| Drift detection | Mann-Whitney U + p-values | · | · | Dashboards only | · |
| Cost budgets + alerts | Daily/monthly caps | · | · | Usage charts only | · |
| MCP server | Native | Plugin | · | · | Partial |
| Commercial tier | None planned | Enterprise | None | SaaS | Confident AI |
Install, wrap one prompt, write one assertion. Promptry takes it from there. Deterministic assertions by default — LLM-as-judge is opt-in, for when you actually want to spend the tokens.