Architecture

How promptry is built, what each component does, and how to approach modifications.

Project structure

promptry/ __init__.py # public API exports config.py # TOML config loading, env var overrides models.py # all dataclasses (PromptRecord, SuiteResult, DriftReport, etc.) registry.py # prompt versioning, track(), track_context() evaluator.py # @suite decorator, thread-local result collection assertions.py # assert_semantic, assert_schema, assert_llm runner.py # suite execution, stores results to DB comparison.py # baseline comparison, root cause analysis drift.py # score trend detection via linear regression scheduler.py # background daemon (PID file, subprocess) notifications.py # webhook + email alerts on regression templates.py # 25+ safety/jailbreak test templates writer.py # async write queue for production use cli.py # typer CLI (all commands) storage/ __init__.py # get_storage() factory, mode-aware base.py # BaseStorage ABC (interface for custom backends) sqlite.py # SQLite implementation (WAL mode, 4 tables) tests/ test_assertions.py # 18 tests test_cli.py # 12 tests test_drift.py # 8 tests test_evaluator.py # 5 tests test_notifications.py # 7 tests test_registry.py # 10 tests test_runner.py # 9 tests test_scheduler.py # 6 tests test_storage.py # 16 tests test_templates.py # 8 tests test_writer.py # 2 tests

Data flow

The two main flows in promptry: tracking prompts and running evals.

track() flow

your code calls: track("You are a helpful...", "rag-qa") | check storage mode / | \ "off" "sync" "async" return | | unchanged | | v v SHA-256 hash content | in-memory cache hit? / \ yes no return | unchanged v SQLiteStorage.save_prompt() (dedup: UNIQUE(name, hash)) | return string unchanged

track() returns the exact same string you passed in. The versioning is invisible. In production with mode = "async", the storage is wrapped in AsyncWriter. Prompt saves are still synchronous (callers need version numbers), but lighter writes like eval results go through the background thread.

Eval suite flow

promptry run my-suite --module evals --compare prod | importlib.import_module("evals") (triggers @suite decorator registration) | run_suite("my-suite") | _execute_test(suite_fn) with run_context() as results: suite_fn() # calls assert_semantic, assert_llm, etc. each assertion appends to thread-local results | compute overall_score = mean(assertion scores) store run + results to SQLite | compare_with_baseline("prod") find baseline run, diff scores, generate root cause hints | exit 0 (pass) or exit 1 (regression)

config.py

Loads configuration from multiple sources in priority order: built-in defaults, promptry.toml in the current directory, ~/.promptry/config.toml, then environment variables. Config is loaded once on first access via get_config() and cached for the process lifetime. Call reset_config() to force a reload.

Env vars: PROMPTRY_DB, PROMPTRY_STORAGE_MODE, PROMPTRY_EMBEDDING_MODEL, PROMPTRY_SEMANTIC_THRESHOLD, PROMPTRY_WEBHOOK_URL, PROMPTRY_SMTP_PASSWORD.

models.py

All data classes live here to avoid circular imports. Nothing in this file imports from other promptry modules. If you need to add a new data shape that gets passed between modules, put it here.

ClassUsed byPurpose
PromptRecordstorage, registryA saved prompt version with id, name, version, hash, tags
SuiteResultrunner, comparison, notificationsComplete result of running a suite
TestResultrunnerOne test function's result (assertions + timing)
ComparisonResultcomparisonSingle metric comparison (baseline vs current)
RootCauseHintcomparisonWhy a regression probably happened
DriftReportdriftScore trend analysis over a window of runs

storage/

BaseStorage is an abstract class defining the interface. SQLiteStorage is the default implementation. The get_storage() factory checks the config mode and wraps the storage in AsyncWriter if mode = "async".

SQLite schema has four tables:

To add a custom backend, subclass BaseStorage and implement all abstract methods.

registry.py

track() is the primary integration point. It hashes content with SHA-256, checks an in-memory cache (bounded at 10k entries), and writes to storage only on new content. The cache means repeated calls with the same prompt are zero-cost.

track_context() does the same for retrieval chunks, with optional sampling (sample_rate) to reduce write volume in production.

evaluator.py

The @suite decorator registers test functions in a global dict. run_context() sets up a thread-local list where assertions append their results. This keeps the user API clean: just call assert_semantic() inside your function, and the runner collects everything automatically. Duplicate suite names emit a warning.

assertions.py

Each assertion function follows the same pattern: evaluate the condition, append an AssertionResult to the current run context, raise AssertionError on failure, return the score.

AssertionWhat it checksCost
assert_schemaData matches a Pydantic modelFree
assert_semanticCosine similarity of embeddings~50ms
assert_llmLLM grades response against criteria~1-5s (one LLM call per assertion)

assert_semantic lazy-loads the embedding model on first use. The model name and similarity threshold are read from config, overridable via set_model().

assert_llm requires a judge function set via set_judge(fn). The function takes a string and returns a string. Works with any LLM provider.

runner.py

Executes a suite by calling the registered function inside a run_context(), catches assertion failures, computes the overall score (mean of assertion scores), and stores everything to the database.

comparison.py

Compares a current run against a baseline. First tries to find a run matching the baseline tag (e.g., "prod"), falls back to the most recent previous run. Generates root cause hints:

drift.py

Fits a linear regression (OLS, no numpy) to the last N scores. If the slope is steeper than -threshold, the suite is flagged as drifting. Catches slow degradation that single-run comparisons miss.

scheduler.py

Spawns a subprocess that loops: import module, run suite, check drift, notify on regression, sleep, repeat. State is managed via PID files in ~/.promptry/.

Cross-platform: on Windows uses CREATE_NO_WINDOW and ctypes.OpenProcess for PID checking. On Unix uses start_new_session and os.kill(pid, 0).

notifications.py

Sends alerts when the monitor detects a regression or drift. Two channels:

Failures are logged but never crash the monitor.

templates.py

25+ built-in attack prompts across 6 categories: prompt injection, jailbreak, PII leakage, hallucination triggers, context boundary, encoding/obfuscation. Custom templates loaded from templates.toml or [[custom_templates]] in promptry.toml. Merged with built-ins automatically.

writer.py

AsyncWriter wraps any BaseStorage with a background thread. save_prompt and save_eval_run are synchronous (callers need the returned IDs). Lighter writes (save_eval_result, tag_prompt) are queued. Uses queue.Queue with a configurable max size (default 10k).

cli.py

Built with Typer. All commands return exit code 0 on success, 1 on regression/failure. CI-friendly: promptry run my-suite --module evals fails the pipeline if scores regressed.

How to modify

Adding a new assertion type

  1. Add your function to assertions.py. Follow the existing pattern: evaluate, call append_result(), raise on failure, return the score.
  2. Export it from __init__.py.
  3. Add the assertion type string to the loop in comparison.py so baseline comparisons track it.
  4. Write tests in test_assertions.py.

Adding a new storage backend

  1. Subclass BaseStorage from storage/base.py.
  2. Your save_prompt must handle dedup (same name + same hash = return existing record).
  3. Your save_prompt must auto-increment version numbers per prompt name.

Adding a new safety template category

  1. Add SafetyTemplate instances to _TEMPLATES in templates.py. Follow the ID convention: category-NNN.
  2. Or add them to templates.toml in your project.

Adding a new CLI command

  1. Use @app.command() for top-level, or create a sub-app with typer.Typer().
  2. Use raise typer.Exit(code=1) for failures so CI gets the right exit code.

Adding a new config section

  1. Add a dataclass to config.py.
  2. Add it as a field on the Config class.
  3. Handle it in _apply_toml() and optionally _apply_env_overrides().

Design decisions

DecisionChoiceWhy
Primary APItrack()One-line integration, zero refactoring
StorageRaw sqlite3Simple schema, stdlib, no deps
ConfigTOMLPython standard (tomllib in 3.11+)
Data modelsdataclassesPydantic only for user-facing assert_schema
Assertion collectionThread-local contextClean user API, no wiring needed
Embedding modelLazy-loaded on first useNo startup cost, downloads once (~80MB)
Trend detectionManual OLS regressionAvoids numpy dependency
Background daemonsubprocess + PID fileCross-platform, no extra deps
LLM judgeUser-provided callableProvider-agnostic, no openai dep
Safety templatesData-driven listEasy to extend, data not logic

SQLite schema

CREATE TABLE prompts (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    version INTEGER NOT NULL,
    content TEXT NOT NULL,
    hash TEXT NOT NULL,
    metadata TEXT,                              -- JSON
    created_at TEXT DEFAULT (datetime('now')),
    UNIQUE(name, version),
    UNIQUE(name, hash)                         -- dedup
);

CREATE TABLE prompt_tags (
    prompt_id INTEGER REFERENCES prompts(id),
    tag TEXT NOT NULL,
    UNIQUE(prompt_id, tag)
);

CREATE TABLE eval_runs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    suite_name TEXT NOT NULL,
    prompt_name TEXT,
    prompt_version INTEGER,
    model_version TEXT,
    timestamp TEXT DEFAULT (datetime('now')),
    overall_pass INTEGER DEFAULT 1,
    overall_score REAL
);

CREATE TABLE eval_results (
    run_id INTEGER REFERENCES eval_runs(id),
    test_name TEXT NOT NULL,
    assertion_type TEXT NOT NULL,
    passed INTEGER NOT NULL,
    score REAL,
    details TEXT,                               -- JSON
    latency_ms REAL
);