ENGINEERING STANDARDS
> cat ./standards/engineering_philosophy.md
This is the living document that governs how I build software. Every project I work on starts with these principles. They aren't theoretical — they're the same standards applied to this site, enforced in CI, and visible in every commit.
- > API contracts are sacred. Define them first, implement second.
- > Simplicity wins. If it's hard to explain, it's probably wrong.
- > Separation of concerns is non-negotiable. Business logic never leaks into transport layers. Data models never bleed across bounded contexts.
- > Documentation is part of the deliverable, not an afterthought.
- > Design for the interface, not the implementation.
- > Tests come first. Always. No exceptions.
- > If it isn't tested, it doesn't exist.
- > Tests are first-class citizens. Same review standard, same CI gate as production code.
WHAT I ACTUALLY TEST:
- > Unit against port traits with in-memory fakes — fast, deterministic, runs on every save.
- > Integration against real databases, not mocks. Mock/prod divergence hides real bugs.
- > Contract at API boundaries — provider and consumer agree, explicitly.
- > E2E for critical user flows only — not every path. Expensive tests are a tax.
- > Data validation tests are a separate discipline — they catch different bugs than code tests, and I count them separately. See [14].
> see also: [14] Data Quality · [18] Code Review
- > Every piece of knowledge has one authoritative source.
- > DRY applies to: code, configuration, documentation, test fixtures, data schemas, and API definitions.
- > Caveat: DRY does not mean premature abstraction. Two instances may be coincidence. Three is a pattern. Abstract on three.
- > Access pattern drives schema design. Write-heavy: normalize (3NF/BCNF). Read-heavy: denormalize (Kimball).
- > Kimball Dimensional Modeling is the default for analytical models. Star schema baseline, snowflake when hierarchy depth justifies join cost.
- > OLTP and OLAP are separate concerns. Separate schemas, separate roles, separate access patterns.
- > Schema changes are versioned migrations, never manual edits. Data contracts apply — downstream consumers must not break from upstream changes.
DBT CONVENTIONS:
- > Layered modeling:
staging→intermediate/marts→analytics. Each layer has a contract with the next. - > Naming:
stg_,int_,fct_,dim_,mart_prefixes. Prefix tells you where it lives and what it is. - > Minimum test coverage per model: not_null + unique on primary keys, relationships on foreign keys, accepted_values on categoricals. Anything less means the model isn't finished.
- > Custom macros for repeated logic. Examples from this project:
hash_pii(),suppress_small_groups(). Write it once. - > dbt docs generated after every schema change. Lineage graph is institutional memory.
> see also: [13] SQL Style · [14] Data Quality
- > Security is phase 0, not phase 2.
- > Post-Quantum Cryptography readiness is a first-class concern. Target CRYSTALS-Kyber (ML-KEM), CRYSTALS-Dilithium (ML-DSA), SHA-3/SHAKE.
- > Crypto agility is mandatory. No algorithm hardcoded. Cipher suites, key sizes, and identifiers must be configurable and swappable.
- > Security scanning is infrastructure: SAST, SCA, secret scanning, container scanning — on every commit.
CVE RESPONSE SLA:
- > Pin versions exactly. No floating specifiers. Lock files are committed and treated as source artifacts.
- > Every new dependency is a liability. Justify it. Document it. Audit regularly for staleness, abandonment, and CVEs.
- > Circular dependencies are bugs. Detect in CI, fail the build.
- > Open-source, self-hostable solutions are the default.
- > No runtime dependency on big-tech hosted services. Their OSS projects evaluated on technical merit only.
- > Self-deployment capability is mandatory. Docker Compose for local dev, container-orchestration-ready for production.
CURRENT DEFAULTS (snapshot, not a law — these evolve):
- > Infra: Docker Compose · Traefik · Cloudflare DNS/DDoS · Hetzner VPS
- > Data: PostgreSQL · dbt Core · Metabase OSS · Streamlit
- > Languages: Rust by default; Python when the ecosystem demands it; TypeScript for frontend
- > Auth: self-hosted Keycloak (not Auth0/Clerk/etc.)
- > Rendering: Astro (static) · self-hosted fonts (no Google CDN)
- > Algorithmic complexity is a design concern, not an optimization concern. Always know the complexity.
- > O(1) > O(log n) > O(n) > O(n log n) > O(n²). Verify no better alternative exists before accepting worse complexity.
- > Premature optimization is still a sin. Design before profiling. But flag O(n²) or worse.
- > Readable by a stranger in 6 months. Functions do one thing.
- > No magic numbers or strings — named constants with clear origin.
- > Error handling is explicit. Silent failures are bugs.
INLINE ANNOTATION TAGS:
Write the subject line in the imperative mood. Imagine the phrase completes: "If applied, this commit will..."
EXAMPLES:
Backend: feat(api): add patient redaction endpoint
Data: fix(etl): resolve timestamp anomaly in incremental load
Frontend: refactor(ui): extract vaporwave button into reusable component
- > READMEs evolve with the code, not written after. Required: quickstart, config reference, architecture overview, testing, deployment, troubleshooting.
- > The Minimal Working Example is non-negotiable. Runnable in under 10 minutes using only the README.
- > Bugs that consume 1+ hour of debugging get documented. Format: Date, Environment, Symptoms, Root Cause, Fix, Prevention.
- > Three or more similar issues = architectural smell. Escalate to an Architecture Decision Record.
FUNDAMENTALS:
PATTERNS I USE REPEATEDLY (from my own projects):
- > Dual storage backends — identical port trait, swap at startup. Example: Postgres ↔ flat JSON in ResumeForge. Lets the same code ship in a database-present and database-absent form.
- > Crypto-agile provider interfaces — a single
CryptoProvidertrait, multiple implementations. Example: Signal-Lens, so adding post-quantum primitives is a new implementation, not a rewrite. - > Independent crate/module boundaries with clean API contracts. Example: Nexus's 9-crate architecture — each crate does one thing, contract is its public types.
- > Monorepo for development, independent publication for consumption. Example: CivicLens/Sentinel — develop together for velocity, ship separately so each crate can be consumed without the rest.
- > All lowercase keywords. SQL reads as prose, not shouting.
- > Comma-first field lists. Enables running queries from the end without selecting all — cursor-friendly debugging.
- > No blank lines between CTEs. Lets cursor-based execution target exactly one CTE at a time.
- > CTEs over nested subqueries. Every named CTE is a unit of thought. Easier to read, easier to refactor, easier to debug.
- > Explicit column lists. Never
SELECT *in production — schema drift silently breaks downstream. - > snake_case for all identifiers. No ambiguity, no case-sensitivity surprises.
- > Trailing commas on the last field. Cleaner diffs when adding or removing columns.
- > Table aliases always explicit, 1-3 chars. Every column reference names its table.
- > Explicit JOIN type (
INNER JOIN,LEFT JOIN), never bareJOIN— forces semantic intent.
> see also: [04] Data Modeling · [14] Data Quality
- > Pipelines get validation tests, not just code tests. A pipeline that runs green but produces wrong data is still broken.
- > Edge cases and failure paths tested explicitly. What happens on empty input? Malformed rows? Duplicate keys? Late-arriving data?
- > Data contracts between upstream sources and downstream consumers. Breaks are detected at the boundary before they propagate.
- > Monitoring and alerting for pipeline health. Run duration, row counts, freshness, error rates — all tracked.
- > Test counts are minimum coverage, not a target. "900+ tests" isn't the brag — "every boundary and edge case covered" is.
- > Data tests and code tests counted separately. They catch different bugs and belong to different disciplines. Don't blend the numbers.
- > Test at the layer boundary: staging gets source-freshness and row-count, marts get business-logic, analytics gets k-anonymity and small-group suppression.
> see also: [02] TDD · [04] Data Modeling · [17] Observability
- > PII stripped before any outbound API call. If it's personal, it doesn't leave the machine without consent and without transformation.
- > User sees the exact payload in a consent dialog before send. Nothing goes over the wire that the user didn't see first.
- > Every call logged: model, prompt, response, tokens, cost, timestamp. No silent spend, full audit trail.
- > Multi-model evaluation. Don't trust a single vendor. Test the same prompt across models before baking the choice into a system.
- > Determinism boundaries are marked. Schema-validated outputs are one thing; probabilistic outputs are another. Readers of the code always know which is which.
- > Cost awareness is first-class. Per-conversation cost tracking, split input/output pricing per model, alerts on anomalous spend.
- > Fallback behavior defined for refusal, parse failure, and timeout. Real systems break at these boundaries; an LLM without a fallback is a production outage waiting to happen.
- > AI-generated content tracked with provenance. Which model, which prompt, which version, which human approved it before it shipped.
> see also: [05] Security · [16] Privacy · [17] Observability
- > Privacy is a design constraint, not a compliance checkbox. If you bolt it on at the end, you've already lost.
- > Data minimization by default. Collect the least you need to do the job. Every field is a future liability.
- > PII classified at the column level. Column-level tags drive tooling: masking, access control, auditing, retention.
- > Jurisdiction-aware consent tracking where applicable. Consent granted in one jurisdiction doesn't grant access in another.
- > Retention policies defined at schema level, not buried in application code. The schema knows when a record expires.
- > Pseudonymization over anonymization where re-linkage may be needed.
hash_pii()with a stable salt preserves joinability without exposing the original. - > Right-to-be-forgotten workflows: cascading deletes, tombstones where needed, and an audit trail of the deletion itself.
> see also: [05] Security · [15] AI/LLM
- > Structured logs with consistent fields (timestamp, level, request_id, user_id where safe). Logs are a queryable stream, not a stack of strings.
- > Log levels used intentionally. DEBUG / INFO / WARN / ERROR / FATAL each mean a specific thing. Don't INFO everything.
- > Health endpoints on every long-running service. Orchestrator needs to know if the thing is alive AND ready.
- > Metrics over logs for anything counted — request counts, latencies, error rates. Log lines for things and events; metrics for aggregates.
- > Dashboards answer questions, not collect metrics. Show the operator what they need to decide; hide everything else.
- > Alerting thresholds tuned to avoid fatigue. Every page must be actionable. A page that can't be acted on is noise training people to ignore the real ones.
- > PII never in logs. Cross-reference [16] — logs are long-lived and widely readable.
> see also: [05] Security · [14] Data Quality · [16] Privacy
- > Every non-trivial change gets review. Solo projects included — review future-you by leaving a self-review checklist in the PR.
- > Reviewers check: does it do what the commit message says? Are tests present and correct? Is error handling explicit? Does it regress anything obvious?
- > Reviewers don't check style (the linter does that) or personal preference (call it out as preference, not a block).
- > Small PRs get faster review. Batch by concern, not by day. A huge multi-concern PR is a bad PR before anyone reads it.
- > Authors respond to every comment, even if just to acknowledge. Silent dismissal is disrespectful and builds debt.
- > Approvals require all conversations resolved. An open thread is an open question; ship the answer, not the question.
- > No self-approval on shared codebases. Self-merge only after a cooling-off period and only with documented justification.
> see also: [02] TDD · [10] Conventional Commits