~/architecture/v07

One binary, one SQLite file, zero services.

Promptry is a library with a CLI, a FastAPI dashboard, a GitHub Action, and an MCP server stapled to the same local store. Your code calls track() and track_invocation(); everything else — evals, cost drill-down, the live prompt CMS, budgets, traces — reads and writes the same SQLite file. No queues, no daemons, no cloud.

~/system

The whole stack on one page.

Your code talks to the SDK. The SDK writes two ledgers to SQLite — versioned prompt templates and per-call invocations. Everything else reads from SQLite.

Architecture diagram: your code, promptry core, and consumers all read and write the same SQLite store.
~/data-model

Nine tables. That's it.

The entire product is a thin layer over this schema, applied through numbered migrations on a schema_version table. Open the database with sqlite3 and every read the dashboard makes is reproducible on the command line.

TABLE · prompts

  • idinteger pk
  • nametext
  • versioninteger
  • contenttext
  • hashtext
  • metadatajson
  • created_attext

One row per unique (name, hash). Dedup + auto-incremented version per name.

TABLE · prompt_tags

  • idinteger pk
  • prompt_idfk → prompts
  • tagtext

Named labels like prod, staging, dev. Promotion moves an env tag to one version, so render_prompt(env=…) serves exactly that one.

TABLE · eval_runs

  • idinteger pk
  • suite_nametext
  • prompt_nametext
  • prompt_versioninteger
  • model_versiontext
  • timestamptext
  • overall_passinteger
  • overall_scorereal

One row per suite execution. Drift and comparison queries read from here.

TABLE · eval_results

  • idinteger pk
  • run_idfk → eval_runs
  • test_nametext
  • assertion_typetext
  • passedinteger
  • scorereal
  • detailsjson
  • latency_msreal

Every assertion, every run. Semantic, judge, JSON, regex, grounding, tool-use, conversation.

TABLE · votes

  • idinteger pk
  • prompt_nametext
  • prompt_versioninteger
  • responsetext
  • scoreinteger
  • messagetext
  • metadatajson
  • created_attext

Thumbs up/down from users. Closes the loop from production back to the eval suite.

TABLE · datasets

  • idinteger pk
  • nametext
  • versioninteger
  • itemsjson
  • metadatajson
  • created_attext

Versioned test data. Pin a suite to a dataset version for reproducible runs.

TABLE · invocations

  • idinteger pk
  • prompt_nametext
  • prompt_versioninteger
  • metadatajson
  • input_texttext · opt
  • output_texttext · opt
  • request_idtext
  • created_attext

The per-call ledger written by track_invocation(). One row per LLM call — no dedup. Tokens, cost, latency and model live in metadata; request/response text is captured only when you opt in.

TABLE · feedback

  • idinteger pk
  • request_idtext
  • prompt_nametext
  • ratingreal
  • commenttext
  • sourcetext
  • created_attext

End-user ratings ingested via POST /api/feedback, correlated to the exact invocation by request_id. Joins back to a call's trace.

TABLE · budgets

  • idinteger pk
  • scopeglobal·module·prompt
  • targettext
  • perioddaily·monthly
  • limit_usdreal
  • created_attext

Spend caps. Current-period spend is summed from the invocations ledger on read; breaches surface in the dashboard.

~/lifecycle

What happens around one call & one suite.

End-to-end on a typical pipeline. Promptry's own overhead is dwarfed by the LLM calls themselves, and it never sits in the request's critical path.

0ms · track() call
Prompt body hashed
If the (name, hash) pair already exists, we return the existing row. Otherwise a new version is inserted.
+2ms · pipeline
Your LLM call runs
The prompt string flows through your pipeline unchanged. Promptry is out of the hot path.
after · track_invocation()
Per-call row appended
One row lands in invocations with tokens, model, latency. Cost is auto-computed from the rate table. No dedup — every call counts.
run · suite starts
Runner loads suites
Registered @suite functions are collected. A row is inserted in eval_runs with the current prompt + model version.
assert · scoring
Assertions resolve
Each assert_* call emits a row in eval_results with type, score, pass/fail, and details.
compare · baseline
Diff against baseline
If you passed --compare prod, the runner reads the tagged prompt's history and produces a score delta + per-assertion deltas.
after · drift
Drift recompute
Rolling window (default 30 runs) recomputes slope, z-score, and Mann-Whitney U on demand. No background jobs.
exit · CI
Exit code
0 on pass, 1 on regression. The GitHub Action turns that into a blocking PR status + comment.
prod · feedback
Ratings flow back
Your app POSTs an end-user rating to /api/feedback with the call's request_id; it lands in feedback, tied to that invocation.
budget · spend
Caps & coverage
Budget status sums spend from the ledger per period; the coverage check flags any model with no rate (silently $0).
debug · bisect
Regression bisect
Walk a suite's run history to the first passing→failing boundary, with the prompt/model delta that broke it.
read · consumers
Everything else reads
Dashboard, CLI, MCP server, the GitHub Action — all read the same store.db. No second source of truth.
~/decisions

Why it's built this way.

Every architectural choice traces back to the same constraint: no service to run, no vendor to trust.

Why SQLite?
Every developer machine has it. No server to run, no port to expose, no migration tooling. The write-throughput ceiling is far above what prompt evals produce, and WAL mode gives us safe concurrent reads while a suite is running.
Why content-hash prompts?
A prompt is its content. Two callers using the same system prompt should share a single history. Automatic dedup means you never have to remember to bump a version — edit the string, save the file, track() does the rest.
Why two ledgers instead of one?
track() versions a prompt's content and dedups by hash — the right shape for a registry. But a cost dashboard needs every call, even byte-identical ones, so track_invocation() appends to a separate invocations table with no dedup. Mixing the two would either double-count templates or silently drop repeat calls. The CMS template lives in prompts; the per-call payload and telemetry live in invocations.
Why a live prompt CMS?
render_prompt() serves a dashboard-edited $placeholder template (cached ~60s), falling back to the in-code default on any miss — a bad edit can't crash a request. Environment tags gate the rollout: edit freely, then promote a version to prod when it's ready. string.Template is used over str.format so literal braces in JSON examples never break substitution.
Why local-first?
Prompts leak business logic. Suites leak user data. Shipping any of it to a third-party vendor is a legal review you shouldn't have to ask for. Remote mode is opt-in and dual-writes, so local is still the source of truth.
Why no daemon?
Background processes die. Cron jobs fail silently. Running on every git push, every CI build, every manual promptry run is more reliable than any schedule. The optional monitor is a convenience, not the primary path.
Why MCP?
Once regressions are diagnosable from structured data, an agent can do the diagnosis for you. MCP is the cleanest way to expose that data to Claude, Cursor, Windsurf, and anything else that speaks the protocol — over plain stdio, no port to open.
Why React for the dashboard?
The dashboard is a React + Vite SPA served as static files by the same local FastAPI server that backs the JSON API. Reads (evals, cost, traces) are GETs; the writes are scoped and local — saving a prompt edit, promoting a version, setting a budget, ingesting feedback, or running the playground against a live model. No backend session, no account, no secrets in the database; API keys stay in your environment.