Catch prompt regressions before your users do.

Python eval suites that version every prompt, track the cost of every call, and catch drift with statistical tests. Edit prompts live, capture traces, set budgets. One SQLite file on your laptop, no account required.

py $ pip install promptry
js $ npm install promptry-js
MIT license / SQLite storage / Zero telemetry
~/rag-service · promptry
from promptry import track, suite, assert_semantic
 
prompt = track("You are a helpful assistant...", "rag-qa")
response = llm.chat(system=prompt, ...)
 
# regression suite
@suite("rag-regression")
def test_quality():
    response = my_pipeline("What is photosynthesis?")
    assert_semantic(response, "Converts light into chemical energy")
$ promptry run rag-regression --compare prod
 
PASS test_quality (142ms)
   semantic(0.891) ok
PASS test_grounding (98ms)
   grounding(0.942) ok
WARN test_safety (218ms)
   refusal drifted -0.04 vs baseline
 
────────────────────────────────
Overall: PASS  score: 0.913  prod: 0.921 (-0.008)
 
→ open dashboard http://localhost:8420
$ promptry prompt diff rag-qa 12 13
 
rag-qa v12 → v13
- You are a helpful assistant. Answer concisely.
+ You are a helpful assistant. Answer in 2-3 sentences,
+ citing sources with [n] markers where relevant.
 
impact across last 500 runs:
  groundedness ↑ 0.847 → 0.921
  avg tokens   ↑ 142 → 188
  cost / 1k    ↑ $0.21 → $0.28
  refusal rate ↓ 2.1% → 1.4%
~/capabilities

Keep a prompt honest.

From the first commit to the cost of every call in production.

#Prompt versioning

-Tweak the prompt, ship it, forget which version is live.
+Every track() call is content-hashed. Diff any two versions.

#Eval suites

-Ad-hoc asserts scattered across notebooks and PR comments.
+Eight assertion types: semantic, schema, judge, tool-use, all in one @suite.

#Regression detection

-Prompt works. Six weeks later, nobody knows what broke it.
+Compare any run to a baseline, then bisect the history to the first run that broke it.

#Live prompt CMS

-Fixing a typo in a prompt means a code change and a redeploy.
+render_prompt() serves dashboard-edited templates live — no redeploy.

#Drift detection

-Scores slowly crawl down. You notice in Q3.
+Rolling-window regression + Mann-Whitney U catches slow degradation.

#Cost tracking

-Bill spikes. Which prompt? Which model? Nobody knows.
+Module → prompt → call drill-down, plus daily/monthly budget caps with breach alerts.

#Trace capture + feedback

-A user thumbs-down a bad answer and it vanishes into a log file.
+Opt-in, sampled request/response capture. Ratings ingest from your app via API.

#CI + GitHub Action

-Reviewer eyeballs a prompt diff and hopes for the best.
+One-line GitHub Action fails the build and posts the score delta.
~/flow

Four steps. Plain-text prompt in, regression tests out.

No server, no account, no telemetry. Just a CLI and a SQLite file.

localhost:8420 · promptry dashboard
local
Workspace
Overview
Suites
Runs
Prompts
Cost
Models
Traces
Playground
Recent
rag-regression
summarizer-v2
safety-suite
prompts · 47 versions tracked
+ 3 this week
nameverhashtagage
rag-qav13a3f9c2eprod2h
rag-qav128b2f14a3d
summarizerv5d91a0b7prod5d
safety-refusalv84c1e52a1w
classify-intentv27890fedstaging1w
code-reviewv32345abc2w
grounding-checkv76f1a09eprod2w
suites · 5 registered
+ run all
rag-regression ● passing
12 tests · semantic + grounding + schema · score 0.913
summarizer-quality ● passing
8 tests · semantic + judge · score 0.872
safety-refusal ▲ drift
6 tests · refusal + not-contains · score 0.844 −0.04σ
classify-intent ● passing
4 tests · matches + schema · score 0.955
grounding-check ● passing
9 tests · grounded + contains · score 0.891
runs · last 20 · main branch
→ GitHub Action
timesuitescoreverdict
10:42rag-regression0.913✓ PASS
09:15summarizer-quality0.872✓ PASS
08:03safety-refusal0.844▲ WARN
07:14classify-intent0.955✓ PASS
06:30rag-regression0.905✓ PASS
05:45summarizer-quality0.870✓ PASS
04:02grounding-check0.891✓ PASS
yesterdaysafety-refusal0.762✗ FAIL
rag-regression · last 14 days
● baseline: prod-2026-04-10
Score
0.913+0.02
Runs
1,284
Pass rate
96.4%-0.8
Cost / 1k
$0.21
Root cause
rag-qa v12 → v13 · refusal template shortened
Drift window
7 day · safety-refusal -0.04σ
~/landscape

Three things you won't get anywhere else.

Code instead of YAML. Local instead of cloud. Deterministic assertions instead of a metered LLM-judge tax. Together, in one tool.

Dimension Promptfoo RAGAS LangSmith DeepEval
Authoring surfacePython decoratorsYAMLPython metricsSaaS UIPython
Data locationLocal SQLiteLocalLocalTheir cloudLocal + push
Account requiredNo, everNoNoYesNo (OSS)
CI cost per run$0 deterministicMixedPer-judge-callTrace volumePer-judge-call
Prompt versioningAutomatic content-hashManual + git·Prompt Hub·
Live prompt editingDashboard, no redeploy··Prompt Hub (cloud)·
Drift detectionMann-Whitney U + p-values··Dashboards only·
Cost budgets + alertsDaily/monthly caps··Usage charts only·
MCP serverNativePlugin··Partial
Commercial tierNone plannedEnterpriseNoneSaaSConfident AI
~/start

Local SQLite. No Docker. No SaaS. No account.

Install, wrap one prompt, write one assertion. Promptry takes it from there. Deterministic assertions by default — LLM-as-judge is opt-in, for when you actually want to spend the tokens.

60-second quickstart
bash
$ pip install promptry
Successfully installed promptry-1.0.0
 
$ promptry init
created promptry.toml · evals.py · SQLite ready
 
$ promptry run smoke-test --module evals
PASS test_quality (142ms)
PASS test_safety  (218ms)
 
$ promptry dashboard
open http://localhost:8420