~/architecture/v07

One binary, one SQLite file, zero services.

Promptry is a library with a CLI, a dashboard, and an MCP server stapled to the same local store. Every component reads and writes the same SQLite database — no queues, no daemons, no cloud.

~/system

The whole stack on one page.

Your code talks to the SDK. The SDK talks to SQLite. Everything else reads from SQLite.

Architecture diagram: your code, promptry core, and consumers all read and write the same SQLite store.
~/data-model

Six tables. That's it.

The entire product is a thin layer over this schema. Open the database with sqlite3 and every read the dashboard makes is reproducible on the command line.

TABLE · prompts

  • idinteger pk
  • nametext
  • versioninteger
  • contenttext
  • hashtext
  • metadatajson
  • created_attext

One row per unique (name, hash). Dedup + auto-incremented version per name.

TABLE · prompt_tags

  • idinteger pk
  • prompt_idfk → prompts
  • tagtext

Named labels like prod, canary. Any version can be compared against a tag.

TABLE · eval_runs

  • idinteger pk
  • suite_nametext
  • prompt_nametext
  • prompt_versioninteger
  • model_versiontext
  • timestamptext
  • overall_passinteger
  • overall_scorereal

One row per suite execution. Drift and comparison queries read from here.

TABLE · eval_results

  • idinteger pk
  • run_idfk → eval_runs
  • test_nametext
  • assertion_typetext
  • passedinteger
  • scorereal
  • detailsjson
  • latency_msreal

Every assertion, every run. Semantic, judge, JSON, regex, grounding, tool-use, conversation.

TABLE · votes

  • idinteger pk
  • prompt_nametext
  • prompt_versioninteger
  • responsetext
  • scoreinteger
  • messagetext
  • metadatajson
  • created_attext

Thumbs up/down from users. Closes the loop from production back to the eval suite.

TABLE · datasets

  • idinteger pk
  • nametext
  • versioninteger
  • itemsjson
  • metadatajson
  • created_attext

Versioned test data. Pin a suite to a dataset version for reproducible runs.

~/lifecycle

What happens when you run one suite.

End-to-end on a typical pipeline. Promptry's own overhead is dwarfed by the LLM calls themselves.

0ms · track() call
Prompt body hashed
If the (name, hash) pair already exists, we return the existing row. Otherwise a new version is inserted.
+2ms · pipeline
Your LLM call runs
The prompt string flows through your pipeline unchanged. Promptry is out of the hot path.
run · suite starts
Runner loads suites
Registered @suite functions are collected. A row is inserted in eval_runs with the current prompt + model version.
assert · scoring
Assertions resolve
Each assert_* call emits a row in eval_results with type, score, pass/fail, and details.
compare · baseline
Diff against baseline
If you passed --compare prod, the runner reads the tagged prompt's history and produces a score delta + per-assertion deltas.
after · drift
Drift recompute
Rolling window (default 30 runs) recomputes slope, z-score, and Mann-Whitney U on demand. No background jobs.
exit · CI
Exit code
0 on pass, 1 on regression. The GitHub Action turns that into a blocking PR status + comment.
read · consumers
Everything else reads
Dashboard, MCP server, vote inbox — all read the same store.db. No second source of truth.
~/decisions

Why it's built this way.

Every architectural choice traces back to the same constraint: no service to run, no vendor to trust.

Why SQLite?
Every developer machine has it. No server to run, no port to expose, no migration tooling. The write-throughput ceiling is far above what prompt evals produce, and WAL mode gives us safe concurrent reads while a suite is running.
Why content-hash prompts?
A prompt is its content. Two callers using the same system prompt should share a single history. Automatic dedup means you never have to remember to bump a version — edit the string, save the file, track() does the rest.
Why local-first?
Prompts leak business logic. Suites leak user data. Shipping any of it to a third-party vendor is a legal review you shouldn't have to ask for. Remote mode is opt-in and dual-writes, so local is still the source of truth.
Why no daemon?
Background processes die. Cron jobs fail silently. Running on every git push, every CI build, every manual promptry run is more reliable than any schedule. The optional monitor is a convenience, not the primary path.
Why MCP?
Once regressions are diagnosable from structured data, an agent can do the diagnosis for you. MCP is the cleanest way to expose that data to Claude, Cursor, Windsurf, and anything else that speaks the protocol — over plain stdio, no port to open.
Why React for the dashboard?
The dashboard is a React + Vite SPA served as static files. It connects to a tiny local FastAPI server that just proxies SQLite reads. No backend state, no session, no secrets — the whole UI is built once and deployed alongside the Python package.