promptry · Architecture

~/system

The whole stack on one page.

Your code talks to the SDK. The SDK talks to SQLite. Everything else reads from SQLite.

Architecture diagram: your code, promptry core, and consumers all read and write the same SQLite store.

~/data-model

Six tables. That's it.

The entire product is a thin layer over this schema. Open the database with sqlite3 and every read the dashboard makes is reproducible on the command line.

TABLE · prompts

idinteger pk
nametext
versioninteger
contenttext
hashtext
metadatajson
created_attext

One row per unique (name, hash). Dedup + auto-incremented version per name.

TABLE · prompt_tags

idinteger pk
prompt_idfk → prompts
tagtext

Named labels like prod, canary. Any version can be compared against a tag.

TABLE · eval_runs

idinteger pk
suite_nametext
prompt_nametext
prompt_versioninteger
model_versiontext
timestamptext
overall_passinteger
overall_scorereal

One row per suite execution. Drift and comparison queries read from here.

TABLE · eval_results

idinteger pk
run_idfk → eval_runs
test_nametext
assertion_typetext
passedinteger
scorereal
detailsjson
latency_msreal

Every assertion, every run. Semantic, judge, JSON, regex, grounding, tool-use, conversation.

TABLE · votes

idinteger pk
prompt_nametext
prompt_versioninteger
responsetext
scoreinteger
messagetext
metadatajson
created_attext

Thumbs up/down from users. Closes the loop from production back to the eval suite.

TABLE · datasets

idinteger pk
nametext
versioninteger
itemsjson
metadatajson
created_attext

Versioned test data. Pin a suite to a dataset version for reproducible runs.

~/lifecycle

What happens when you run one suite.

End-to-end on a typical pipeline. Promptry's own overhead is dwarfed by the LLM calls themselves.

0ms · track() call

Prompt body hashed

If the (name, hash) pair already exists, we return the existing row. Otherwise a new version is inserted.

+2ms · pipeline

Your LLM call runs

The prompt string flows through your pipeline unchanged. Promptry is out of the hot path.

run · suite starts

Runner loads suites

Registered @suite functions are collected. A row is inserted in eval_runs with the current prompt + model version.

assert · scoring

Assertions resolve

Each assert_* call emits a row in eval_results with type, score, pass/fail, and details.

compare · baseline

Diff against baseline

If you passed --compare prod, the runner reads the tagged prompt's history and produces a score delta + per-assertion deltas.

after · drift

Drift recompute

Rolling window (default 30 runs) recomputes slope, z-score, and Mann-Whitney U on demand. No background jobs.

exit · CI

Exit code

0 on pass, 1 on regression. The GitHub Action turns that into a blocking PR status + comment.

read · consumers

Everything else reads

Dashboard, MCP server, vote inbox — all read the same store.db. No second source of truth.

~/decisions

Why it's built this way.

Every architectural choice traces back to the same constraint: no service to run, no vendor to trust.

Why SQLite?

Every developer machine has it. No server to run, no port to expose, no migration tooling. The write-throughput ceiling is far above what prompt evals produce, and WAL mode gives us safe concurrent reads while a suite is running.

Why content-hash prompts?

A prompt is its content. Two callers using the same system prompt should share a single history. Automatic dedup means you never have to remember to bump a version — edit the string, save the file, track() does the rest.

Why local-first?

Prompts leak business logic. Suites leak user data. Shipping any of it to a third-party vendor is a legal review you shouldn't have to ask for. Remote mode is opt-in and dual-writes, so local is still the source of truth.

Why no daemon?

Background processes die. Cron jobs fail silently. Running on every git push, every CI build, every manual promptry run is more reliable than any schedule. The optional monitor is a convenience, not the primary path.

Why MCP?

Once regressions are diagnosable from structured data, an agent can do the diagnosis for you. MCP is the cleanest way to expose that data to Claude, Cursor, Windsurf, and anything else that speaks the protocol — over plain stdio, no port to open.

Why React for the dashboard?

The dashboard is a React + Vite SPA served as static files. It connects to a tiny local FastAPI server that just proxies SQLite reads. No backend state, no session, no secrets — the whole UI is built once and deployed alongside the Python package.

One binary, one SQLite file, zero services.