Installation
Promptry ships as a Python package (the core library, CLI, dashboard server, and MCP server) and a matching JavaScript client. Both read and write the same .promptry/store.db, so you can mix languages in a single repo without a second source of truth.
# Python: core + CLI
$ pip install promptry
# Optional extras
$ pip install 'promptry[semantic]' # local embeddings (all-MiniLM-L6-v2)
$ pip install 'promptry[dashboard]' # FastAPI + uvicorn for the web UI
# JavaScript / TypeScript: remote tracking only
$ npm install promptry-js
The Python install has no runtime dependencies beyond the standard library and SQLite. Extras are opt-in so a CI runner that only needs promptry run stays under 10 MB.
Track your prompts
Add one line, don't change anything else:
from promptry import track
prompt = track("You are a helpful assistant...", "rag-qa")
response = llm.chat(system=prompt, ...)
track() returns the same string it was given. Behind the scenes it content-hashes the body and inserts a new row only if the content changed. Two callers using the same prompt body share a single row and a single history.
Works the same if your prompt is built inside a function:
def call_rag(question, context, prompt_name="rag-qa"):
system = track(
f"Answer using only this context:\n{context}",
prompt_name,
)
return llm.chat(system=system, user=question)
Track retrieval context
For RAG pipelines, track the retrieved chunks alongside the prompt. When a regression shows up, you can tell whether it was the prompt or the retrieval that changed.
from promptry import track, track_context
prompt = track(system_prompt, "rag-qa")
chunks = track_context(retrieved_chunks, "rag-qa")
response = llm.chat(system=prompt, context=chunks, user=query)
In production you probably don't want to write every call. Sample at the call site:
track_context(chunks, "rag-qa", sample_rate=0.1) # 10% of calls
Or set a default in promptry.toml:
[tracking]
context_sample_rate = 0.1
Write eval suites
A suite is a plain Python function, registered with @suite:
from promptry import suite, assert_semantic
@suite("rag-regression")
def test_rag_quality():
response = my_pipeline("What is photosynthesis?")
assert_semantic(response, "Photosynthesis converts light into chemical energy")
Then run it:
$ promptry run rag-regression --module my_evals
PASS test_rag_quality (142ms)
semantic (0.891) ok
Overall: PASS score: 0.891
Assertions
Promptry ships a family of assertion primitives. Each one records its score, pass/fail, and details into eval_results, so the dashboard and the comparison engine see every check — not just the first failure.
Semantic similarity
Local embedding cosine similarity. Fast, free, no provider key.
from promptry import assert_semantic
assert_semantic(response, "An explanation of machine learning concepts")
Requires promptry[semantic]. The first call downloads all-MiniLM-L6-v2 (~80 MB) into the HuggingFace cache. Subsequent calls load from disk in under 500 ms.
LLM-as-judge
Embedding similarity catches "are these roughly about the same thing?" It can't judge tone, correctness, or whether the response actually followed instructions. assert_llm uses an LLM to grade responses against criteria you define.
First wire up your judge. Any callable that takes a string and returns a string works:
from promptry import set_judge
from openai import OpenAI
client = OpenAI()
def my_judge(prompt: str) -> str:
r = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
return r.choices[0].message.content
set_judge(my_judge)
Then use it:
@suite("rag-regression")
def test_rag_quality():
response = my_pipeline("What is photosynthesis?")
# fast, local, free
assert_semantic(response, "Photosynthesis converts light into chemical energy")
# slower, but catches things embeddings can't
assert_llm(
response,
criteria="Accurately explains photosynthesis using only the provided context, "
"without hallucinating facts not in the source material.",
threshold=0.7,
)
The judge is provider-agnostic — OpenAI, Anthropic, local models, whatever you already use.
JSON + schema
Most LLM pipelines return JSON. assert_json_valid handles the messy reality: markdown fences, trailing commas, leading prose.
from promptry import assert_json_valid, clean_json, assert_schema
from pydantic import BaseModel
class PricingModel(BaseModel):
vendor: str
total_value: float
currency: str
response = my_pipeline(document)
# gate: parseable JSON at all?
assert_json_valid(response)
# get the cleaned, parsed object
data = clean_json(response)
# validate schema
assert_schema(data, PricingModel)
clean_json() is a standalone utility — use it anywhere you need to extract JSON from LLM output:
clean_json('{"key": "value"}')
clean_json('```json\n{"key": "value"}\n```')
clean_json('Here is the JSON: {"key": "value",}') # trailing comma fixed
Regex matching
assert_matches checks a response matches a pattern. Fullmatch by default (entire response must match), or set fullmatch=False for partial search.
# classification must be exactly one of these words
assert_matches(classify(doc), r"(tender|rfp|rfq|eoi)")
# response contains an email somewhere
assert_matches(response, r"[\w.+-]+@[\w-]+\.[\w.]+", fullmatch=False)
Factual grounding
assert_grounded uses an LLM judge to verify that facts in a response actually exist in the source document. It decomposes the response into claims and checks each one:
from promptry import assert_grounded
assert_grounded(
response=extract_pricing(document),
source=document,
threshold=0.9, # strict for financial data
)
On failure, the details show exactly what was fabricated:
AssertionError: Grounding score 0.500 < threshold 0.9.
Fabricated: 3 phases; 15,00,000 per phase
The result stores a claim-by-claim breakdown under details["claims"], which the dashboard renders as a coloured list. Requires a judge, configured via set_judge().
Tool-use traces
When the thing you're testing is an agent, you often care less about the final text and more about how it got there: which tools it called, in what order, with what arguments. Three assertions work on a trace — a list of tool calls.
from promptry import (
assert_tool_called, assert_tool_sequence, assert_no_tool_called,
)
trace = [
{"name": "search", "args": ["python tutorials"], "kwargs": {"limit": 10}},
{"name": "summarize", "args": ["..."], "kwargs": {}},
{"name": "rank", "args": [], "kwargs": {"top_k": 3}},
]
assert_tool_called(trace, "search", kwargs={"limit": 10})
assert_tool_sequence(trace, ["search", "rank"]) # subsequence, not adjacency
assert_no_tool_called(trace, "delete_database") # safety invariant
The trace format is permissive — raw OpenAI tool_calls and Anthropic tool_use blocks are normalized automatically.
Conversations
For chatbots, copilots, and agents that hold state across turns, promptry has a first-class Conversation data model and conversation-level assertions.
from promptry import (
Conversation,
assert_conversation_length, assert_all_assistant_turns,
assert_conversation_coherent, assert_no_repetition,
)
conv = Conversation()
conv.add("user", "Hi, what's the weather?")
conv.add("assistant", my_chatbot(conv))
conv.add("user", "And tomorrow?")
conv.add("assistant", my_chatbot(conv))
assert_conversation_length(conv, min_turns=2, max_turns=20)
assert_all_assistant_turns(conv, lambda t: assert_contains(t, ["weather"]))
assert_conversation_coherent(conv, threshold=0.4)
assert_no_repetition(conv, similarity_threshold=0.92)
If you already have a messages list from the SDK you use, drop it in directly — Conversation.from_openai(messages) or Conversation.from_anthropic(messages) normalize tool calls and multimodal content into turns.
Chain with check_all
By default, assertions stop at the first failure. check_all() runs every check and gives you a complete report:
@suite("pricing-pipeline")
def test_pricing():
response = pipeline(document)
data = clean_json(response)
check_all(
lambda: assert_json_valid(response),
lambda: assert_schema(data, PricingModel),
lambda: assert_grounded(response, document),
lambda: assert_contains(response, ["total_value", "currency"]),
)
If 2 out of 4 fail, you get one error with everything — and the runner still sees every check for drift and comparison purposes.
Cost tracking
Pass token and cost metadata to track(). Promptry rolls it up by prompt name, model, and day.
response = llm.chat(system=prompt, ...)
track(prompt, "pricing-extract", metadata={
"tokens_in": response.usage.prompt_tokens,
"tokens_out": response.usage.completion_tokens,
"cached_tokens": response.usage.prompt_tokens_details.cached_tokens,
"model": "gpt-4o",
"cost": 0.003,
})
$ promptry cost-report --days 30
Cost report (last 30 days)
┌──────────────────┬───────┬───────────┬────────────┬─────────┐
│ Prompt │ Calls │ Tokens In │ Tokens Out │ Cost │
├──────────────────┼───────┼───────────┼────────────┼─────────┤
│ pricing-extract │ 847 │ 423,500 │ 84,700 │ $2.5410 │
│ doc-classify │ 1,203 │ 120,300 │ 1,203 │ $0.1203 │
├──────────────────┼───────┼───────────┼────────────┼─────────┤
│ Total │ 2,050 │ 543,800 │ 85,903 │ $2.6613 │
└──────────────────┴───────┴───────────┴────────────┴─────────┘
Cache awareness
LLM providers each expose prompt caching differently. Promptry reads the cache usage fields that each provider reports, calculates the right cost, and shows the hit rate in promptry cost-report and the dashboard.
- OpenAI (GPT-4o, 4.1, etc.): automatic caching for prompts > 1024 tokens, ~50% discount on cached reads, reported as
usage.prompt_tokens_details.cached_tokens. - Anthropic (Claude): explicit opt-in via
cache_control: {"type": "ephemeral"}. Cached reads are 10% of base rate (90% off); cache writes are 125%–200% depending on TTL. Reported asusage.cache_read_input_tokensandusage.cache_creation_input_tokens. - Gemini: explicit via
cachedContentsAPI, best for long documents queried repeatedly. - Grok: like OpenAI, automatic for long prompts, ~25% discount.
Optimization tip. Put static content (system prompt, long docs) at the beginning of the prompt. Prefix matching means earlier content is reused across queries. Monitor cache hit rate via promptry cost-report — if < 30% for a frequently called prompt, there's room to restructure.
Model comparison
When you're evaluating a model upgrade, promptry compares the candidate against the full statistical distribution of your baseline model's history, not a single snapshot.
# already running evals with gpt-4o for weeks
$ promptry run rag-regression --module evals --model-version gpt-4o
# try claude-sonnet-4
$ promptry run rag-regression --module evals --model-version claude-sonnet-4
# compare candidate against baseline history
$ promptry compare rag-regression --candidate claude-sonnet-4
Model comparison: gpt-4o (47 runs) vs claude-sonnet-4 (1 runs)
gpt-4o claude-sonnet-4
Overall score 0.887 +/- 0.031 0.921
[0.821, 0.943] +0.034 (89th pctl)
By assertion type:
json_valid 0.980 +/- 0.020 1.000 [+] better
grounding 0.850 +/- 0.050 0.910 [+] better
schema 0.970 +/- 0.030 0.940 [~] comparable
semantic 0.860 +/- 0.040 0.900 [+] better
Cost: $0.0050 $0.0030
Candidate is 40% cheaper
Verdict: SWITCH
Candidate scores +0.034 higher (above 89th percentile of baseline).
Also 40% cheaper. Watch: schema slightly lower.
You get mean, variance, percentiles, per-assertion trends, and cost efficiency — statistical confidence, not a single data point.
Baseline comparison
Tag whatever version you know works. Then check future runs against it.
$ promptry prompt tag rag-qa 3 prod
Tagged rag-qa v3 as prod
$ promptry run rag-regression --module my_evals --compare prod
PASS test_rag_quality (142ms)
semantic (0.891) ok
Overall: PASS score: 0.946
Comparing against prod baseline:
Overall score: 0.910 -> 0.946 ok
If scores dropped, promptry tells you what changed:
Overall score: 0.910 -> 0.720 REGRESSION
Probable cause:
-> Prompt changed (v3 -> v4)
Drift detection
See if scores are trending down over time. Drift detection runs on demand — no background job required.
$ promptry drift rag-regression --module my_evals
Suite: rag-regression
Window: 22/30 runs
Latest score: 0.820
Mean +/- stddev: 0.876 +/- 0.041
Latest z-score: -1.37
Slope: -0.0072
Significance (recent vs older half): p=0.018
Confidence: high
Status: DRIFTING (slope < -0.005)
Statistical rigor
Promptry computes three signals over the window (default 30 runs):
- OLS linear slope. Steep negative slope means sustained downward trend.
- Z-score of the latest run vs the window's mean and stddev — tells you how unusual the most recent score is.
- Mann-Whitney U p-value comparing the recent half of the window against the older half. Non-parametric rank-sum test; doesn't assume normality.
The confidence field combines all three into one label:
insufficient— fewer than 10 runs in the window.low— scores stable.medium— slope trending down or recent half significantly lower, but not both.high— slope trending down and p < 0.05.
What it doesn't do. It's not a change-point detector (use CUSUM or Bayesian online CPD for that). It doesn't correct for multiple comparisons across suites — if you run drift on 50 suites with p < 0.05 you'll see ~2.5 false positives by chance, so apply Bonferroni (p < 0.05 / num_suites) manually if that matters. Ties in scores aren't corrected in the U statistic; with continuous LLM scores this rarely matters. Small samples are flagged — fewer than 16 runs returns p = None because the normal approximation needs ~8 per group.
Background monitor
Runs your evals on a schedule in a detached process. Dev-only convenience — for production use cron or a CI schedule instead.
$ promptry monitor start rag-regression --module my_evals --interval 60
Monitor started (PID 48291)
$ promptry monitor status
Suite: rag-regression
Interval: 60m
Started: 2026-03-04T14:30:00
Last run: 2026-03-04T15:30:00
Last score: 0.946
Drift: stable
$ promptry monitor stop
Spawns a subprocess (not a thread). On Unix it uses start_new_session to detach from the terminal; on Windows it uses CREATE_NO_WINDOW. PID and state live in ~/.promptry/. If the process crashes, monitor status detects the stale PID and cleans up.
Safety templates
25+ built-in attack prompts to test how your pipeline handles adversarial inputs: prompt injection, jailbreaks, PII fishing, hallucination triggers, encoding tricks.
$ promptry templates list
$ promptry templates run --module my_app
PASS injection-001 Direct override (0.72)
PASS injection-002 Context injection (0.68)
FAIL jailbreak-001 DAN (Do Anything Now) (0.31)
Semantic similarity to expected safe behavior: 0.310
Results: 24 passed, 1 failed out of 25
Templates are graded by comparing the response against refusal anchors vs compliance anchors using semantic similarity. If you've configured a judge via set_judge(), it uses that instead for more accurate grading. Add your own in templates.toml.
Storage modes
By default track() writes to SQLite synchronously. For production you can change that in promptry.toml:
[storage]
mode = "async" # writes go to a background thread
# mode = "off" # disables writes entirely; track() passes through
# mode = "remote" # dual-write local + HTTP POST to a central endpoint
- sync — default, inline writes. Fine for dev and testing.
- async — background thread handles writes,
track()returns immediately. - remote — dual-writes to local SQLite + batched HTTP POST. Centralize telemetry from multiple services.
- off — no writes at all. Use this if you only manage prompts through the CLI.
Watch mode
Like pytest --watch for prompts. promptry watch re-runs your suites every time a file in your eval module changes.
# watch the default module (evals.py)
promptry watch
# watch a single suite, compare against prod on every run
promptry watch rag-regression --compare prod
# tweak the debounce window if your editor fires many save events
promptry watch --debounce 300
On each change, it clears the screen, reloads the module fresh (clearing the suite registry so stale definitions don't linger), and runs again. Import errors and suite exceptions print inline — fix and save to retry.
CI + GitHub Action
Promptry ships a composite action that runs your suite on every PR, posts a comment with the diff, and fails the build on regression.
# .github/workflows/eval.yml
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: bihanikeshav/promptry@v0.6.0
with:
suite: rag-regression
module: evals
compare: prod
pr-comment: "true"
The comment is edited in place on subsequent pushes, so PRs don't get spammed. Regressions are surfaced when an assertion score drops by more than 0.05 against the previous run, or when a previously-passing test starts failing. Under the hood the action runs promptry run ... --markdown <file> — you can invoke the same flag locally to preview what the bot will post.
Dashboard
A web UI for eval history, prompt diffs, model comparisons, and cost data. Data never leaves your machine — the hosted UI connects to your local server.
$ pip install 'promptry[dashboard]'
$ promptry dashboard # :8420, opens hosted UI
$ promptry dashboard --local # open localhost instead
$ promptry dashboard --port 9000 # custom port
Pages: Overview (all suites with pass/fail, sparklines, drift), Suite Detail (score history, assertion breakdown, root cause), Run Detail (per-assertion results with grounding claim breakdowns), Prompts (version history with git-diff style comparison), Models (statistical comparison + SWITCH/KEEP verdict), Cost (token + cost over time), and the Playground for live prompt iteration.
MCP server
Promptry includes a built-in MCP server so any LLM agent can manage prompts, run evals, compare models, check drift, and run safety audits through tool calls.
$ promptry mcp
This starts a stdio-based server. Add it to your editor:
Claude Code
pip install promptry
claude mcp add promptry -- promptry mcp
Claude Desktop
Edit claude_desktop_config.json (~/Library/Application Support/Claude/ on macOS, %APPDATA%\Claude\ on Windows):
{
"mcpServers": {
"promptry": {
"command": "promptry",
"args": ["mcp"]
}
}
}
Cursor / Windsurf / VS Code
Same config, different file: .cursor/mcp.json, ~/.codeium/windsurf/mcp_config.json, or .vscode/mcp.json (the VS Code one nests under "servers" instead of "mcpServers").
Virtualenv tip. promptry must be on the agent's PATH. Either use the full path (/path/to/venv/bin/promptry), or run without a global install via uvx:
claude mcp add promptry -- uvx promptry mcp
# Or in a JSON config:
"command": "uvx", "args": ["promptry", "mcp"]
Available tools: prompt_list, prompt_show, prompt_diff, prompt_save, prompt_tag, list_suites, run_eval, check_drift, compare_models, cost_report, list_templates, run_safety_audit, monitor_status. All return plain text so agents can reason about the results directly.
JavaScript client
promptry-js is a lightweight JS/TS client that ships prompt tracking events to the same ingest endpoint as the Python RemoteStorage backend. Zero runtime dependencies, ~5 KB minified, works in browsers and Node 18+.
import { init, track, trackContext, flush } from 'promptry-js';
init({ endpoint: 'https://your-server.com/ingest' });
// returns content unchanged, ships event in background
const prompt = track(systemPrompt, 'rag-qa');
const chunks = trackContext(retrievedChunks, 'rag-qa');
await flush();
The JS client only ships events. All heavy lifting (evals, drift, comparison) stays in Python — both clients POST to the same endpoint, so all telemetry lands in the same place.
CLI reference
Every command supports --help for full usage.
# scaffold a new project
promptry init
# prompts
promptry prompt save prompt.txt --name rag-qa --tag prod
promptry prompt list
promptry prompt show rag-qa
promptry prompt diff rag-qa 1 2
promptry prompt tag rag-qa 3 canary
# evals
promptry run <suite> --module <mod> [--compare prod]
promptry suites --module <mod>
promptry drift <suite> --module <mod>
promptry watch [suite] [--module <mod>] [--compare prod]
# cost + comparison
promptry cost-report [--days 7] [--name <prompt>] [--model <model>]
promptry compare <suite> --candidate <model> [--baseline <model>]
# monitoring
promptry monitor start <suite> --module <mod> [--interval 1440]
promptry monitor stop
promptry monitor status
# safety
promptry templates list [--category <cat>]
promptry templates run --module <mod> [--func <name>] [--category <cat>]
# integrations
promptry dashboard [--port 8420] [--no-open] [--local]
promptry mcp
Exit code 0 on success, 1 on regression — so promptry run drops straight into a CI step.
Config
Drop a promptry.toml in your project root:
[storage]
db_path = "~/.promptry/promptry.db"
mode = "sync"
[tracking]
sample_rate = 1.0
context_sample_rate = 0.1
[model]
embedding_model = "all-MiniLM-L6-v2"
semantic_threshold = 0.8
[monitor]
interval_minutes = 1440
threshold = 0.05
window = 30
[notifications]
webhook_url = "https://hooks.slack.com/services/..."
email = "alerts@example.com"
Override with env vars: PROMPTRY_DB, PROMPTRY_STORAGE_MODE, PROMPTRY_EMBEDDING_MODEL, PROMPTRY_SEMANTIC_THRESHOLD, PROMPTRY_WEBHOOK_URL, PROMPTRY_SMTP_PASSWORD.
Examples
Check the examples/ directory for working demos:
basic_rag.py— self-contained RAG pipeline with tracking, eval suites, and safety testing. No API keys needed.llm_judge.py— wiring upassert_llmwith OpenAI, Anthropic, or local models.assertion_pipeline.py— chainingassert_json_valid,assert_matches,assert_grounded, andcheck_allinto validation pipelines for document extraction.
pip install -e .
python examples/basic_rag.py
python examples/assertion_pipeline.py
promptry run pricing-failfast --module examples.assertion_pipeline
promptry run doc-classify --module examples.assertion_pipeline