~/promptry·released April 2026·v0.7.0

Catch prompt regressions before your users do.

Python-native eval suites, automatic prompt versioning, statistical drift detection. One SQLite file on your machine — no YAML, no account, no cloud, no per-run judge tax.

py $ pip install promptry

js $ npm install promptry-js

Read the docs View on GitHub

MIT license / SQLite storage / Zero telemetry

~/rag-service · promptry

from promptry import track, suite, assert_semantic

prompt = track("You are a helpful assistant...", "rag-qa")

response = llm.chat(system=prompt, ...)

# regression suite

@suite("rag-regression")

def test_quality():

response = my_pipeline("What is photosynthesis?")

assert_semantic(response, "Converts light into chemical energy")

$ promptry run rag-regression --compare prod

PASS test_quality (142ms)

└ semantic(0.891) ok

PASS test_grounding (98ms)

└ grounding(0.942) ok

WARN test_safety (218ms)

└ refusal drifted -0.04 vs baseline

────────────────────────────────

Overall: PASS score: 0.913 prod: 0.921 (-0.008)

→ open dashboard http://localhost:8420

$ promptry prompt diff rag-qa 12 13

rag-qa v12 → v13

- You are a helpful assistant. Answer concisely.

+ You are a helpful assistant. Answer in 2-3 sentences,

+ citing sources with [n] markers where relevant.

impact across last 500 runs:

groundedness ↑ 0.847 → 0.921

avg tokens ↑ 142 → 188

cost / 1k ↑ $0.21 → $0.28

refusal rate ↓ 2.1% → 1.4%

~/capabilities

Everything you need to keep a prompt honest.

Eight primitives that cover the lifecycle, from the moment a prompt lands in source control to the moment it misbehaves in production.

#Prompt versioning

-Tweak the prompt, ship it, forget which version is live.

+Every track() call is content-hashed. Diff any two versions.

#Eval suites

-Ad-hoc asserts scattered across notebooks and PR comments.

+Eight assertion types: semantic, schema, judge, tool-use, all in one @suite.

#Regression detection

-Prompt works. Six weeks later, nobody knows what broke it.

+Compare any run to a baseline. Root-cause hints flag the exact delta.

#Drift detection

-Scores slowly crawl down. You notice in Q3.

+Rolling-window regression + Mann-Whitney U catches slow degradation.

#Safety templates

-"We should probably red-team this." Said in Slack, last quarter.

+25 curated attacks across 6 categories. Extensible via templates.toml.

#Cost tracking

-Bill spikes. Which prompt? Which model? Nobody knows.

+Per-prompt, per-model, per-suite accounting. Cache-aware everywhere.

#CI + GitHub Action

-Reviewer eyeballs a prompt diff and hopes for the best.

+One-line GitHub Action fails the build and posts the score delta.

#MCP server

-Agent has no idea if its own prompt is any good.

+MCP endpoint exposes suites, diffs, and scores. Agent debugs itself.

~/flow

Four steps from plain-text prompt to a regression-proof pipeline.

No server to run. No account to create. No vendor telemetry to audit.

localhost:8420 · promptry dashboard

local

Workspace

Overview

Suites

Runs

Prompts

Models

Cost

Recent

rag-regression

summarizer-v2

safety-suite

Score

0.913+0.02

Runs

1,284

Pass rate

96.4%-0.8

Cost / 1k

$0.21

Root cause

rag-qa v12 → v13 · refusal template shortened

Drift window

7 day · safety-refusal -0.04σ

~/landscape

Three things you won't get anywhere else.

Code instead of YAML. Local instead of cloud. Deterministic assertions instead of a metered LLM-judge tax. Together, in one tool.

Dimension	promptry	Promptfoo	RAGAS	LangSmith	DeepEval
Authoring surface	Python decorators	YAML	Python metrics	SaaS UI	Python
Data location	Local SQLite	Local	Local	Their cloud	Local + push
Account required	No, ever	No	No	Yes	No (OSS)
CI cost per run	$0 deterministic	Mixed	Per-judge-call	Trace volume	Per-judge-call
Prompt versioning	Automatic content-hash	Manual + git	·	Prompt Hub	·
Drift detection	Mann-Whitney U + p-values	·	·	Dashboards only	·
MCP server	Native	Plugin	·	·	Partial
Commercial tier	None planned	Enterprise	None	SaaS	Confident AI

~/start

Local SQLite. No Docker. No SaaS. No account.

Install, wrap one prompt, write one assertion. Promptry takes it from there. Deterministic assertions by default — LLM-as-judge is opt-in, for when you actually want to spend the tokens.

Read the docs Examples on GitHub

60-second quickstart

bash

$ pip install promptry

Successfully installed promptry-0.7.0

$ promptry init

created promptry.toml · evals.py · SQLite ready

$ promptry run smoke-test --module evals

PASS test_quality (142ms)

PASS test_safety (218ms)

$ promptry dashboard

→ open http://localhost:8420