Catch prompt regressions before your users do.

Python-native eval suites, automatic prompt versioning, statistical drift detection. One SQLite file on your machine — no YAML, no account, no cloud, no per-run judge tax.

py $ pip install promptry
js $ npm install promptry-js
MIT license / SQLite storage / Zero telemetry
~/rag-service · promptry
from promptry import track, suite, assert_semantic
 
prompt = track("You are a helpful assistant...", "rag-qa")
response = llm.chat(system=prompt, ...)
 
# regression suite
@suite("rag-regression")
def test_quality():
    response = my_pipeline("What is photosynthesis?")
    assert_semantic(response, "Converts light into chemical energy")
$ promptry run rag-regression --compare prod
 
PASS test_quality (142ms)
   semantic(0.891) ok
PASS test_grounding (98ms)
   grounding(0.942) ok
WARN test_safety (218ms)
   refusal drifted -0.04 vs baseline
 
────────────────────────────────
Overall: PASS  score: 0.913  prod: 0.921 (-0.008)
 
→ open dashboard http://localhost:8420
$ promptry prompt diff rag-qa 12 13
 
rag-qa v12 → v13
- You are a helpful assistant. Answer concisely.
+ You are a helpful assistant. Answer in 2-3 sentences,
+ citing sources with [n] markers where relevant.
 
impact across last 500 runs:
  groundedness ↑ 0.847 → 0.921
  avg tokens   ↑ 142 → 188
  cost / 1k    ↑ $0.21 → $0.28
  refusal rate ↓ 2.1% → 1.4%
~/capabilities

Everything you need to keep a prompt honest.

Eight primitives that cover the lifecycle, from the moment a prompt lands in source control to the moment it misbehaves in production.

#Prompt versioning

-Tweak the prompt, ship it, forget which version is live.
+Every track() call is content-hashed. Diff any two versions.

#Eval suites

-Ad-hoc asserts scattered across notebooks and PR comments.
+Eight assertion types: semantic, schema, judge, tool-use, all in one @suite.

#Regression detection

-Prompt works. Six weeks later, nobody knows what broke it.
+Compare any run to a baseline. Root-cause hints flag the exact delta.

#Drift detection

-Scores slowly crawl down. You notice in Q3.
+Rolling-window regression + Mann-Whitney U catches slow degradation.

#Safety templates

-"We should probably red-team this." Said in Slack, last quarter.
+25 curated attacks across 6 categories. Extensible via templates.toml.

#Cost tracking

-Bill spikes. Which prompt? Which model? Nobody knows.
+Per-prompt, per-model, per-suite accounting. Cache-aware everywhere.

#CI + GitHub Action

-Reviewer eyeballs a prompt diff and hopes for the best.
+One-line GitHub Action fails the build and posts the score delta.

#MCP server

-Agent has no idea if its own prompt is any good.
+MCP endpoint exposes suites, diffs, and scores. Agent debugs itself.
~/flow

Four steps from plain-text prompt to a regression-proof pipeline.

No server to run. No account to create. No vendor telemetry to audit.

localhost:8420 · promptry dashboard
local
Workspace
Overview
Suites
Runs
Prompts
Models
Cost
Recent
rag-regression
summarizer-v2
safety-suite
rag-regression · last 14 days
● baseline: prod-2026-04-10
Score
0.913+0.02
Runs
1,284
Pass rate
96.4%-0.8
Cost / 1k
$0.21
Root cause
rag-qa v12 → v13 · refusal template shortened
Drift window
7 day · safety-refusal -0.04σ
~/landscape

Three things you won't get anywhere else.

Code instead of YAML. Local instead of cloud. Deterministic assertions instead of a metered LLM-judge tax. Together, in one tool.

Dimension Promptfoo RAGAS LangSmith DeepEval
Authoring surfacePython decoratorsYAMLPython metricsSaaS UIPython
Data locationLocal SQLiteLocalLocalTheir cloudLocal + push
Account requiredNo, everNoNoYesNo (OSS)
CI cost per run$0 deterministicMixedPer-judge-callTrace volumePer-judge-call
Prompt versioningAutomatic content-hashManual + git·Prompt Hub·
Drift detectionMann-Whitney U + p-values··Dashboards only·
MCP serverNativePlugin··Partial
Commercial tierNone plannedEnterpriseNoneSaaSConfident AI
~/start

Local SQLite. No Docker. No SaaS. No account.

Install, wrap one prompt, write one assertion. Promptry takes it from there. Deterministic assertions by default — LLM-as-judge is opt-in, for when you actually want to spend the tokens.

60-second quickstart
bash
$ pip install promptry
Successfully installed promptry-0.7.0
 
$ promptry init
created promptry.toml · evals.py · SQLite ready
 
$ promptry run smoke-test --module evals
PASS test_quality (142ms)
PASS test_safety  (218ms)
 
$ promptry dashboard
open http://localhost:8420