Architecture
How promptry is built, what each component does, and how to approach modifications.
Project structure
Data flow
The two main flows in promptry: tracking prompts and running evals.
track() flow
track() returns the exact same string you passed in. The versioning is invisible. In production with mode = "async", the storage is wrapped in AsyncWriter. Prompt saves are still synchronous (callers need version numbers), but lighter writes like eval results go through the background thread.
Eval suite flow
config.py
Loads configuration from multiple sources in priority order: built-in defaults, promptry.toml in the current directory, ~/.promptry/config.toml, then environment variables. Config is loaded once on first access via get_config() and cached for the process lifetime. Call reset_config() to force a reload.
Env vars: PROMPTRY_DB, PROMPTRY_STORAGE_MODE, PROMPTRY_EMBEDDING_MODEL, PROMPTRY_SEMANTIC_THRESHOLD, PROMPTRY_WEBHOOK_URL, PROMPTRY_SMTP_PASSWORD.
models.py
All data classes live here to avoid circular imports. Nothing in this file imports from other promptry modules. If you need to add a new data shape that gets passed between modules, put it here.
| Class | Used by | Purpose |
|---|---|---|
PromptRecord | storage, registry | A saved prompt version with id, name, version, hash, tags |
SuiteResult | runner, comparison, notifications | Complete result of running a suite |
TestResult | runner | One test function's result (assertions + timing) |
ComparisonResult | comparison | Single metric comparison (baseline vs current) |
RootCauseHint | comparison | Why a regression probably happened |
DriftReport | drift | Score trend analysis over a window of runs |
storage/
BaseStorage is an abstract class defining the interface. SQLiteStorage is the default implementation. The get_storage() factory checks the config mode and wraps the storage in AsyncWriter if mode = "async".
SQLite schema has four tables:
prompts— versioned prompt content with SHA-256 dedup (UNIQUE(name, hash))prompt_tags— tags (like "prod", "canary") attached to prompt versionseval_runs— suite execution records with overall score and pass/faileval_results— individual assertion results within a run
To add a custom backend, subclass BaseStorage and implement all abstract methods.
registry.py
track() is the primary integration point. It hashes content with SHA-256, checks an in-memory cache (bounded at 10k entries), and writes to storage only on new content. The cache means repeated calls with the same prompt are zero-cost.
track_context() does the same for retrieval chunks, with optional sampling (sample_rate) to reduce write volume in production.
evaluator.py
The @suite decorator registers test functions in a global dict. run_context() sets up a thread-local list where assertions append their results. This keeps the user API clean: just call assert_semantic() inside your function, and the runner collects everything automatically. Duplicate suite names emit a warning.
assertions.py
Each assertion function follows the same pattern: evaluate the condition, append an AssertionResult to the current run context, raise AssertionError on failure, return the score.
| Assertion | What it checks | Cost |
|---|---|---|
assert_schema | Data matches a Pydantic model | Free |
assert_semantic | Cosine similarity of embeddings | ~50ms |
assert_llm | LLM grades response against criteria | ~1-5s (one LLM call per assertion) |
assert_semantic lazy-loads the embedding model on first use. The model name and similarity threshold are read from config, overridable via set_model().
assert_llm requires a judge function set via set_judge(fn). The function takes a string and returns a string. Works with any LLM provider.
runner.py
Executes a suite by calling the registered function inside a run_context(), catches assertion failures, computes the overall score (mean of assertion scores), and stores everything to the database.
comparison.py
Compares a current run against a baseline. First tries to find a run matching the baseline tag (e.g., "prod"), falls back to the most recent previous run. Generates root cause hints:
- Prompt version changed → "Prompt changed (v3 → v4)"
- Model version changed → "Model changed (gpt-4 → gpt-4o)"
- Neither changed but scores dropped → "Possible retrieval drift"
drift.py
Fits a linear regression (OLS, no numpy) to the last N scores. If the slope is steeper than -threshold, the suite is flagged as drifting. Catches slow degradation that single-run comparisons miss.
scheduler.py
Spawns a subprocess that loops: import module, run suite, check drift, notify on regression, sleep, repeat. State is managed via PID files in ~/.promptry/.
Cross-platform: on Windows uses CREATE_NO_WINDOW and ctypes.OpenProcess for PID checking. On Unix uses start_new_session and os.kill(pid, 0).
notifications.py
Sends alerts when the monitor detects a regression or drift. Two channels:
- Webhooks: POSTs JSON to any URL (Slack, Discord, custom). URL validated to be http/https.
- Email: SMTP with TLS. Credentials should use env vars (
PROMPTRY_SMTP_PASSWORD).
Failures are logged but never crash the monitor.
templates.py
25+ built-in attack prompts across 6 categories: prompt injection, jailbreak, PII leakage, hallucination triggers, context boundary, encoding/obfuscation. Custom templates loaded from templates.toml or [[custom_templates]] in promptry.toml. Merged with built-ins automatically.
writer.py
AsyncWriter wraps any BaseStorage with a background thread. save_prompt and save_eval_run are synchronous (callers need the returned IDs). Lighter writes (save_eval_result, tag_prompt) are queued. Uses queue.Queue with a configurable max size (default 10k).
cli.py
Built with Typer. All commands return exit code 0 on success, 1 on regression/failure. CI-friendly: promptry run my-suite --module evals fails the pipeline if scores regressed.
How to modify
Adding a new assertion type
- Add your function to
assertions.py. Follow the existing pattern: evaluate, callappend_result(), raise on failure, return the score. - Export it from
__init__.py. - Add the assertion type string to the loop in
comparison.pyso baseline comparisons track it. - Write tests in
test_assertions.py.
Adding a new storage backend
- Subclass
BaseStoragefromstorage/base.py. - Your
save_promptmust handle dedup (same name + same hash = return existing record). - Your
save_promptmust auto-increment version numbers per prompt name.
Adding a new safety template category
- Add
SafetyTemplateinstances to_TEMPLATESintemplates.py. Follow the ID convention:category-NNN. - Or add them to
templates.tomlin your project.
Adding a new CLI command
- Use
@app.command()for top-level, or create a sub-app withtyper.Typer(). - Use
raise typer.Exit(code=1)for failures so CI gets the right exit code.
Adding a new config section
- Add a dataclass to
config.py. - Add it as a field on the
Configclass. - Handle it in
_apply_toml()and optionally_apply_env_overrides().
Design decisions
| Decision | Choice | Why |
|---|---|---|
| Primary API | track() | One-line integration, zero refactoring |
| Storage | Raw sqlite3 | Simple schema, stdlib, no deps |
| Config | TOML | Python standard (tomllib in 3.11+) |
| Data models | dataclasses | Pydantic only for user-facing assert_schema |
| Assertion collection | Thread-local context | Clean user API, no wiring needed |
| Embedding model | Lazy-loaded on first use | No startup cost, downloads once (~80MB) |
| Trend detection | Manual OLS regression | Avoids numpy dependency |
| Background daemon | subprocess + PID file | Cross-platform, no extra deps |
| LLM judge | User-provided callable | Provider-agnostic, no openai dep |
| Safety templates | Data-driven list | Easy to extend, data not logic |
SQLite schema
CREATE TABLE prompts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL,
version INTEGER NOT NULL,
content TEXT NOT NULL,
hash TEXT NOT NULL,
metadata TEXT, -- JSON
created_at TEXT DEFAULT (datetime('now')),
UNIQUE(name, version),
UNIQUE(name, hash) -- dedup
);
CREATE TABLE prompt_tags (
prompt_id INTEGER REFERENCES prompts(id),
tag TEXT NOT NULL,
UNIQUE(prompt_id, tag)
);
CREATE TABLE eval_runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
suite_name TEXT NOT NULL,
prompt_name TEXT,
prompt_version INTEGER,
model_version TEXT,
timestamp TEXT DEFAULT (datetime('now')),
overall_pass INTEGER DEFAULT 1,
overall_score REAL
);
CREATE TABLE eval_results (
run_id INTEGER REFERENCES eval_runs(id),
test_name TEXT NOT NULL,
assertion_type TEXT NOT NULL,
passed INTEGER NOT NULL,
score REAL,
details TEXT, -- JSON
latency_ms REAL
);