Installation

Promptry ships as a Python package (the core library, CLI, dashboard server, and MCP server) and a matching JavaScript client. Both read and write the same .promptry/store.db, so you can mix languages in a single repo without a second source of truth.

# Python: core + CLI
$ pip install promptry

# Optional extras
$ pip install 'promptry[semantic]'     # local embeddings (all-MiniLM-L6-v2)
$ pip install 'promptry[dashboard]'    # FastAPI + uvicorn for the web UI

# JavaScript / TypeScript: remote tracking only
$ npm install promptry-js

The Python install has no runtime dependencies beyond the standard library and SQLite. Extras are opt-in so a CI runner that only needs promptry run stays under 10 MB.

Track your prompts

Add one line, don't change anything else:

from promptry import track

prompt = track("You are a helpful assistant...", "rag-qa")
response = llm.chat(system=prompt, ...)

track() returns the same string it was given. Behind the scenes it content-hashes the body and inserts a new row only if the content changed. Two callers using the same prompt body share a single row and a single history.

Works the same if your prompt is built inside a function:

def call_rag(question, context, prompt_name="rag-qa"):
    system = track(
        f"Answer using only this context:\n{context}",
        prompt_name,
    )
    return llm.chat(system=system, user=question)

Track retrieval context

For RAG pipelines, track the retrieved chunks alongside the prompt. When a regression shows up, you can tell whether it was the prompt or the retrieval that changed.

from promptry import track, track_context

prompt = track(system_prompt, "rag-qa")
chunks = track_context(retrieved_chunks, "rag-qa")
response = llm.chat(system=prompt, context=chunks, user=query)

In production you probably don't want to write every call. Sample at the call site:

track_context(chunks, "rag-qa", sample_rate=0.1)  # 10% of calls

Or set a default in promptry.toml:

[tracking]
context_sample_rate = 0.1

Write eval suites

A suite is a plain Python function, registered with @suite:

from promptry import suite, assert_semantic

@suite("rag-regression")
def test_rag_quality():
    response = my_pipeline("What is photosynthesis?")
    assert_semantic(response, "Photosynthesis converts light into chemical energy")

Then run it:

$ promptry run rag-regression --module my_evals

  PASS test_rag_quality (142ms)
    semantic (0.891) ok

  Overall: PASS  score: 0.891

Assertions

Promptry ships a family of assertion primitives. Each one records its score, pass/fail, and details into eval_results, so the dashboard and the comparison engine see every check — not just the first failure.

Semantic similarity

Local embedding cosine similarity. Fast, free, no provider key.

from promptry import assert_semantic

assert_semantic(response, "An explanation of machine learning concepts")

Requires promptry[semantic]. The first call downloads all-MiniLM-L6-v2 (~80 MB) into the HuggingFace cache. Subsequent calls load from disk in under 500 ms.

LLM-as-judge

Embedding similarity catches "are these roughly about the same thing?" It can't judge tone, correctness, or whether the response actually followed instructions. assert_llm uses an LLM to grade responses against criteria you define.

First wire up your judge. Any callable that takes a string and returns a string works:

from promptry import set_judge
from openai import OpenAI

client = OpenAI()

def my_judge(prompt: str) -> str:
    r = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return r.choices[0].message.content

set_judge(my_judge)

Then use it:

@suite("rag-regression")
def test_rag_quality():
    response = my_pipeline("What is photosynthesis?")

    # fast, local, free
    assert_semantic(response, "Photosynthesis converts light into chemical energy")

    # slower, but catches things embeddings can't
    assert_llm(
        response,
        criteria="Accurately explains photosynthesis using only the provided context, "
                 "without hallucinating facts not in the source material.",
        threshold=0.7,
    )

The judge is provider-agnostic — OpenAI, Anthropic, local models, whatever you already use.

JSON + schema

Most LLM pipelines return JSON. assert_json_valid handles the messy reality: markdown fences, trailing commas, leading prose.

from promptry import assert_json_valid, clean_json, assert_schema
from pydantic import BaseModel

class PricingModel(BaseModel):
    vendor: str
    total_value: float
    currency: str

response = my_pipeline(document)

# gate: parseable JSON at all?
assert_json_valid(response)

# get the cleaned, parsed object
data = clean_json(response)

# validate schema
assert_schema(data, PricingModel)

clean_json() is a standalone utility — use it anywhere you need to extract JSON from LLM output:

clean_json('{"key": "value"}')
clean_json('```json\n{"key": "value"}\n```')
clean_json('Here is the JSON: {"key": "value",}')  # trailing comma fixed

Regex matching

assert_matches checks a response matches a pattern. Fullmatch by default (entire response must match), or set fullmatch=False for partial search.

# classification must be exactly one of these words
assert_matches(classify(doc), r"(tender|rfp|rfq|eoi)")

# response contains an email somewhere
assert_matches(response, r"[\w.+-]+@[\w-]+\.[\w.]+", fullmatch=False)

Factual grounding

assert_grounded uses an LLM judge to verify that facts in a response actually exist in the source document. It decomposes the response into claims and checks each one:

from promptry import assert_grounded

assert_grounded(
    response=extract_pricing(document),
    source=document,
    threshold=0.9,  # strict for financial data
)

On failure, the details show exactly what was fabricated:

AssertionError: Grounding score 0.500 < threshold 0.9.
  Fabricated: 3 phases; 15,00,000 per phase

The result stores a claim-by-claim breakdown under details["claims"], which the dashboard renders as a coloured list. Requires a judge, configured via set_judge().

Tool-use traces

When the thing you're testing is an agent, you often care less about the final text and more about how it got there: which tools it called, in what order, with what arguments. Three assertions work on a trace — a list of tool calls.

from promptry import (
    assert_tool_called, assert_tool_sequence, assert_no_tool_called,
)

trace = [
    {"name": "search",    "args": ["python tutorials"], "kwargs": {"limit": 10}},
    {"name": "summarize", "args": ["..."],              "kwargs": {}},
    {"name": "rank",      "args": [],                   "kwargs": {"top_k": 3}},
]

assert_tool_called(trace, "search", kwargs={"limit": 10})
assert_tool_sequence(trace, ["search", "rank"])       # subsequence, not adjacency
assert_no_tool_called(trace, "delete_database")      # safety invariant

The trace format is permissive — raw OpenAI tool_calls and Anthropic tool_use blocks are normalized automatically.

Conversations

For chatbots, copilots, and agents that hold state across turns, promptry has a first-class Conversation data model and conversation-level assertions.

from promptry import (
    Conversation,
    assert_conversation_length, assert_all_assistant_turns,
    assert_conversation_coherent, assert_no_repetition,
)

conv = Conversation()
conv.add("user", "Hi, what's the weather?")
conv.add("assistant", my_chatbot(conv))
conv.add("user", "And tomorrow?")
conv.add("assistant", my_chatbot(conv))

assert_conversation_length(conv, min_turns=2, max_turns=20)
assert_all_assistant_turns(conv, lambda t: assert_contains(t, ["weather"]))
assert_conversation_coherent(conv, threshold=0.4)
assert_no_repetition(conv, similarity_threshold=0.92)

If you already have a messages list from the SDK you use, drop it in directly — Conversation.from_openai(messages) or Conversation.from_anthropic(messages) normalize tool calls and multimodal content into turns.

Chain with check_all

By default, assertions stop at the first failure. check_all() runs every check and gives you a complete report:

@suite("pricing-pipeline")
def test_pricing():
    response = pipeline(document)
    data = clean_json(response)

    check_all(
        lambda: assert_json_valid(response),
        lambda: assert_schema(data, PricingModel),
        lambda: assert_grounded(response, document),
        lambda: assert_contains(response, ["total_value", "currency"]),
    )

If 2 out of 4 fail, you get one error with everything — and the runner still sees every check for drift and comparison purposes.

Cost tracking

Pass token and cost metadata to track(). Promptry rolls it up by prompt name, model, and day.

response = llm.chat(system=prompt, ...)

track(prompt, "pricing-extract", metadata={
    "tokens_in": response.usage.prompt_tokens,
    "tokens_out": response.usage.completion_tokens,
    "cached_tokens": response.usage.prompt_tokens_details.cached_tokens,
    "model": "gpt-4o",
    "cost": 0.003,
})

$ promptry cost-report --days 30

Cost report (last 30 days)
┌──────────────────┬───────┬───────────┬────────────┬─────────┐
│ Prompt           │ Calls │ Tokens In │ Tokens Out │ Cost    │
├──────────────────┼───────┼───────────┼────────────┼─────────┤
│ pricing-extract  │   847 │   423,500 │     84,700 │ $2.5410 │
│ doc-classify     │ 1,203 │   120,300 │      1,203 │ $0.1203 │
├──────────────────┼───────┼───────────┼────────────┼─────────┤
│ Total            │ 2,050 │   543,800 │     85,903 │ $2.6613 │
└──────────────────┴───────┴───────────┴────────────┴─────────┘

Cache awareness

LLM providers each expose prompt caching differently. Promptry reads the cache usage fields that each provider reports, calculates the right cost, and shows the hit rate in promptry cost-report and the dashboard.

OpenAI (GPT-4o, 4.1, etc.): automatic caching for prompts > 1024 tokens, ~50% discount on cached reads, reported as usage.prompt_tokens_details.cached_tokens.
Anthropic (Claude): explicit opt-in via cache_control: {"type": "ephemeral"}. Cached reads are 10% of base rate (90% off); cache writes are 125%–200% depending on TTL. Reported as usage.cache_read_input_tokens and usage.cache_creation_input_tokens.
Gemini: explicit via cachedContents API, best for long documents queried repeatedly.
Grok: like OpenAI, automatic for long prompts, ~25% discount.

Optimization tip. Put static content (system prompt, long docs) at the beginning of the prompt. Prefix matching means earlier content is reused across queries. Monitor cache hit rate via promptry cost-report — if < 30% for a frequently called prompt, there's room to restructure.

Model comparison

When you're evaluating a model upgrade, promptry compares the candidate against the full statistical distribution of your baseline model's history, not a single snapshot.

# already running evals with gpt-4o for weeks
$ promptry run rag-regression --module evals --model-version gpt-4o

# try claude-sonnet-4
$ promptry run rag-regression --module evals --model-version claude-sonnet-4

# compare candidate against baseline history
$ promptry compare rag-regression --candidate claude-sonnet-4

Model comparison: gpt-4o (47 runs) vs claude-sonnet-4 (1 runs)

                     gpt-4o              claude-sonnet-4
Overall score        0.887 +/- 0.031     0.921
                     [0.821, 0.943]     +0.034 (89th pctl)

By assertion type:
  json_valid         0.980 +/- 0.020    1.000  [+] better
  grounding          0.850 +/- 0.050    0.910  [+] better
  schema             0.970 +/- 0.030    0.940  [~] comparable
  semantic           0.860 +/- 0.040    0.900  [+] better

Cost:                $0.0050              $0.0030
Candidate is 40% cheaper

Verdict: SWITCH
  Candidate scores +0.034 higher (above 89th percentile of baseline).
  Also 40% cheaper. Watch: schema slightly lower.

You get mean, variance, percentiles, per-assertion trends, and cost efficiency — statistical confidence, not a single data point.

Baseline comparison

Tag whatever version you know works. Then check future runs against it.

$ promptry prompt tag rag-qa 3 prod
Tagged rag-qa v3 as prod

$ promptry run rag-regression --module my_evals --compare prod

  PASS test_rag_quality (142ms)
    semantic (0.891) ok

  Overall: PASS  score: 0.946
  Comparing against prod baseline:
  Overall score: 0.910 -> 0.946  ok

If scores dropped, promptry tells you what changed:

  Overall score: 0.910 -> 0.720  REGRESSION

  Probable cause:
    -> Prompt changed (v3 -> v4)

Drift detection

See if scores are trending down over time. Drift detection runs on demand — no background job required.

$ promptry drift rag-regression --module my_evals

  Suite: rag-regression
  Window: 22/30 runs
  Latest score: 0.820
  Mean +/- stddev: 0.876 +/- 0.041
  Latest z-score: -1.37
  Slope: -0.0072
  Significance (recent vs older half): p=0.018
  Confidence: high
  Status: DRIFTING (slope < -0.005)

Statistical rigor

Promptry computes three signals over the window (default 30 runs):

OLS linear slope. Steep negative slope means sustained downward trend.
Z-score of the latest run vs the window's mean and stddev — tells you how unusual the most recent score is.
Mann-Whitney U p-value comparing the recent half of the window against the older half. Non-parametric rank-sum test; doesn't assume normality.

The confidence field combines all three into one label:

insufficient — fewer than 10 runs in the window.
low — scores stable.
medium — slope trending down or recent half significantly lower, but not both.
high — slope trending down and p < 0.05.

What it doesn't do. It's not a change-point detector (use CUSUM or Bayesian online CPD for that). It doesn't correct for multiple comparisons across suites — if you run drift on 50 suites with p < 0.05 you'll see ~2.5 false positives by chance, so apply Bonferroni (p < 0.05 / num_suites) manually if that matters. Ties in scores aren't corrected in the U statistic; with continuous LLM scores this rarely matters. Small samples are flagged — fewer than 16 runs returns p = None because the normal approximation needs ~8 per group.

Background monitor

Runs your evals on a schedule in a detached process. Dev-only convenience — for production use cron or a CI schedule instead.

$ promptry monitor start rag-regression --module my_evals --interval 60
Monitor started (PID 48291)

$ promptry monitor status
  Suite: rag-regression
  Interval: 60m
  Started: 2026-03-04T14:30:00
  Last run: 2026-03-04T15:30:00
  Last score: 0.946
  Drift: stable

$ promptry monitor stop

Spawns a subprocess (not a thread). On Unix it uses start_new_session to detach from the terminal; on Windows it uses CREATE_NO_WINDOW. PID and state live in ~/.promptry/. If the process crashes, monitor status detects the stale PID and cleans up.

Safety templates

25+ built-in attack prompts to test how your pipeline handles adversarial inputs: prompt injection, jailbreaks, PII fishing, hallucination triggers, encoding tricks.

$ promptry templates list
$ promptry templates run --module my_app

  PASS injection-001 Direct override (0.72)
  PASS injection-002 Context injection (0.68)
  FAIL jailbreak-001 DAN (Do Anything Now) (0.31)
    Semantic similarity to expected safe behavior: 0.310

  Results: 24 passed, 1 failed out of 25

Templates are graded by comparing the response against refusal anchors vs compliance anchors using semantic similarity. If you've configured a judge via set_judge(), it uses that instead for more accurate grading. Add your own in templates.toml.

Storage modes

By default track() writes to SQLite synchronously. For production you can change that in promptry.toml:

[storage]
mode = "async"    # writes go to a background thread
# mode = "off"    # disables writes entirely; track() passes through
# mode = "remote" # dual-write local + HTTP POST to a central endpoint

sync — default, inline writes. Fine for dev and testing.
async — background thread handles writes, track() returns immediately.
remote — dual-writes to local SQLite + batched HTTP POST. Centralize telemetry from multiple services.
off — no writes at all. Use this if you only manage prompts through the CLI.

Watch mode

Like pytest --watch for prompts. promptry watch re-runs your suites every time a file in your eval module changes.

# watch the default module (evals.py)
promptry watch

# watch a single suite, compare against prod on every run
promptry watch rag-regression --compare prod

# tweak the debounce window if your editor fires many save events
promptry watch --debounce 300

On each change, it clears the screen, reloads the module fresh (clearing the suite registry so stale definitions don't linger), and runs again. Import errors and suite exceptions print inline — fix and save to retry.

CI + GitHub Action

Promptry ships a composite action that runs your suite on every PR, posts a comment with the diff, and fails the build on regression.

# .github/workflows/eval.yml
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: bihanikeshav/promptry@v0.6.0
        with:
          suite: rag-regression
          module: evals
          compare: prod
          pr-comment: "true"

The comment is edited in place on subsequent pushes, so PRs don't get spammed. Regressions are surfaced when an assertion score drops by more than 0.05 against the previous run, or when a previously-passing test starts failing. Under the hood the action runs promptry run ... --markdown <file> — you can invoke the same flag locally to preview what the bot will post.

Dashboard

A web UI for eval history, prompt diffs, model comparisons, and cost data. Data never leaves your machine — the hosted UI connects to your local server.

$ pip install 'promptry[dashboard]'
$ promptry dashboard                # :8420, opens hosted UI
$ promptry dashboard --local        # open localhost instead
$ promptry dashboard --port 9000    # custom port

Pages: Overview (all suites with pass/fail, sparklines, drift), Suite Detail (score history, assertion breakdown, root cause), Run Detail (per-assertion results with grounding claim breakdowns), Prompts (version history with git-diff style comparison), Models (statistical comparison + SWITCH/KEEP verdict), Cost (token + cost over time), and the Playground for live prompt iteration.

MCP server

Promptry includes a built-in MCP server so any LLM agent can manage prompts, run evals, compare models, check drift, and run safety audits through tool calls.

$ promptry mcp

This starts a stdio-based server. Add it to your editor:

Claude Code

pip install promptry
claude mcp add promptry -- promptry mcp

Claude Desktop

Edit claude_desktop_config.json (~/Library/Application Support/Claude/ on macOS, %APPDATA%\Claude\ on Windows):

{
  "mcpServers": {
    "promptry": {
      "command": "promptry",
      "args": ["mcp"]
    }
  }
}

Cursor / Windsurf / VS Code

Same config, different file: .cursor/mcp.json, ~/.codeium/windsurf/mcp_config.json, or .vscode/mcp.json (the VS Code one nests under "servers" instead of "mcpServers").

Virtualenv tip. promptry must be on the agent's PATH. Either use the full path (/path/to/venv/bin/promptry), or run without a global install via uvx:

claude mcp add promptry -- uvx promptry mcp

# Or in a JSON config:
"command": "uvx", "args": ["promptry", "mcp"]

Available tools: prompt_list, prompt_show, prompt_diff, prompt_save, prompt_tag, list_suites, run_eval, check_drift, compare_models, cost_report, list_templates, run_safety_audit, monitor_status. All return plain text so agents can reason about the results directly.

JavaScript client

promptry-js is a lightweight JS/TS client that ships prompt tracking events to the same ingest endpoint as the Python RemoteStorage backend. Zero runtime dependencies, ~5 KB minified, works in browsers and Node 18+.

import { init, track, trackContext, flush } from 'promptry-js';

init({ endpoint: 'https://your-server.com/ingest' });

// returns content unchanged, ships event in background
const prompt = track(systemPrompt, 'rag-qa');
const chunks = trackContext(retrievedChunks, 'rag-qa');

await flush();

The JS client only ships events. All heavy lifting (evals, drift, comparison) stays in Python — both clients POST to the same endpoint, so all telemetry lands in the same place.

CLI reference

Every command supports --help for full usage.

# scaffold a new project
promptry init

# prompts
promptry prompt save prompt.txt --name rag-qa --tag prod
promptry prompt list
promptry prompt show rag-qa
promptry prompt diff rag-qa 1 2
promptry prompt tag rag-qa 3 canary

# evals
promptry run <suite> --module <mod> [--compare prod]
promptry suites --module <mod>
promptry drift <suite> --module <mod>
promptry watch [suite] [--module <mod>] [--compare prod]

# cost + comparison
promptry cost-report [--days 7] [--name <prompt>] [--model <model>]
promptry compare <suite> --candidate <model> [--baseline <model>]

# monitoring
promptry monitor start <suite> --module <mod> [--interval 1440]
promptry monitor stop
promptry monitor status

# safety
promptry templates list [--category <cat>]
promptry templates run --module <mod> [--func <name>] [--category <cat>]

# integrations
promptry dashboard [--port 8420] [--no-open] [--local]
promptry mcp

Exit code 0 on success, 1 on regression — so promptry run drops straight into a CI step.

Config

Drop a promptry.toml in your project root:

[storage]
db_path = "~/.promptry/promptry.db"
mode = "sync"

[tracking]
sample_rate = 1.0
context_sample_rate = 0.1

[model]
embedding_model = "all-MiniLM-L6-v2"
semantic_threshold = 0.8

[monitor]
interval_minutes = 1440
threshold = 0.05
window = 30

[notifications]
webhook_url = "https://hooks.slack.com/services/..."
email = "alerts@example.com"

Override with env vars: PROMPTRY_DB, PROMPTRY_STORAGE_MODE, PROMPTRY_EMBEDDING_MODEL, PROMPTRY_SEMANTIC_THRESHOLD, PROMPTRY_WEBHOOK_URL, PROMPTRY_SMTP_PASSWORD.

Examples

Check the examples/ directory for working demos:

basic_rag.py — self-contained RAG pipeline with tracking, eval suites, and safety testing. No API keys needed.
llm_judge.py — wiring up assert_llm with OpenAI, Anthropic, or local models.
assertion_pipeline.py — chaining assert_json_valid, assert_matches, assert_grounded, and check_all into validation pipelines for document extraction.

pip install -e .

python examples/basic_rag.py
python examples/assertion_pipeline.py

promptry run pricing-failfast --module examples.assertion_pipeline
promptry run doc-classify --module examples.assertion_pipeline