ENGINEERING STANDARDS

> cat ./standards/engineering_philosophy.md

This is the living document that governs how I build software. Every project I work on starts with these principles. They aren't theoretical — they're the same standards applied to this site, enforced in CI, and visible in every commit.

01 // CLEAN ARCHITECTURE & DESIGN

> API contracts are sacred. Define them first, implement second.
> Simplicity wins. If it's hard to explain, it's probably wrong.
> Separation of concerns is non-negotiable. Business logic never leaks into transport layers. Data models never bleed across bounded contexts.
> Documentation is part of the deliverable, not an afterthought.
> Design for the interface, not the implementation.

02 // TEST-DRIVEN DEVELOPMENT

> Tests come first. Always. No exceptions.
> If it isn't tested, it doesn't exist.
> Tests are first-class citizens. Same review standard, same CI gate as production code.

WHAT I ACTUALLY TEST:

> Unit against port traits with in-memory fakes — fast, deterministic, runs on every save.
> Integration against real databases, not mocks. Mock/prod divergence hides real bugs.
> Contract at API boundaries — provider and consumer agree, explicitly.
> E2E for critical user flows only — not every path. Expensive tests are a tax.
> Data validation tests are a separate discipline — they catch different bugs than code tests, and I count them separately. See [14].

> see also: [14] Data Quality · [18] Code Review

03 // DON'T REPEAT YOURSELF

> Every piece of knowledge has one authoritative source.
> DRY applies to: code, configuration, documentation, test fixtures, data schemas, and API definitions.
> Caveat: DRY does not mean premature abstraction. Two instances may be coincidence. Three is a pattern. Abstract on three.

04 // DATA MODELING & ARCHITECTURE

> Access pattern drives schema design. Write-heavy: normalize (3NF/BCNF). Read-heavy: denormalize (Kimball).
> Kimball Dimensional Modeling is the default for analytical models. Star schema baseline, snowflake when hierarchy depth justifies join cost.
> OLTP and OLAP are separate concerns. Separate schemas, separate roles, separate access patterns.
> Schema changes are versioned migrations, never manual edits. Data contracts apply — downstream consumers must not break from upstream changes.

DBT CONVENTIONS:

> Layered modeling: staging → intermediate/marts → analytics. Each layer has a contract with the next.
> Naming: stg_, int_, fct_, dim_, mart_ prefixes. Prefix tells you where it lives and what it is.
> Minimum test coverage per model: not_null + unique on primary keys, relationships on foreign keys, accepted_values on categoricals. Anything less means the model isn't finished.
> Custom macros for repeated logic. Examples from this project: hash_pii(), suppress_small_groups(). Write it once.
> dbt docs generated after every schema change. Lineage graph is institutional memory.

> see also: [13] SQL Style · [14] Data Quality

05 // SECURITY & CRYPTOGRAPHY

> Security is phase 0, not phase 2.
> Post-Quantum Cryptography readiness is a first-class concern. Target CRYSTALS-Kyber (ML-KEM), CRYSTALS-Dilithium (ML-DSA), SHA-3/SHAKE.
> Crypto agility is mandatory. No algorithm hardcoded. Cipher suites, key sizes, and identifiers must be configurable and swappable.
> Security scanning is infrastructure: SAST, SCA, secret scanning, container scanning — on every commit.

CVE RESPONSE SLA:

CRITICAL: before next merge HIGH: this sprint MEDIUM: tracked LOW: triaged & documented

06 // DEPENDENCY MANAGEMENT

> Pin versions exactly. No floating specifiers. Lock files are committed and treated as source artifacts.
> Every new dependency is a liability. Justify it. Document it. Audit regularly for staleness, abandonment, and CVEs.
> Circular dependencies are bugs. Detect in CI, fail the build.

07 // TECHNOLOGY SOVEREIGNTY

> Open-source, self-hostable solutions are the default.
> No runtime dependency on big-tech hosted services. Their OSS projects evaluated on technical merit only.
> Self-deployment capability is mandatory. Docker Compose for local dev, container-orchestration-ready for production.

CURRENT DEFAULTS (snapshot, not a law — these evolve):

> Infra: Docker Compose · Traefik · Cloudflare DNS/DDoS · Hetzner VPS
> Data: PostgreSQL · dbt Core · Metabase OSS · Streamlit
> Languages: Rust by default; Python when the ecosystem demands it; TypeScript for frontend
> Auth: self-hosted Keycloak (not Auth0/Clerk/etc.)
> Rendering: Astro (static) · self-hosted fonts (no Google CDN)

08 // PROGRAMMATIC EFFICIENCY

> Algorithmic complexity is a design concern, not an optimization concern. Always know the complexity.
> O(1) > O(log n) > O(n) > O(n log n) > O(n²). Verify no better alternative exists before accepting worse complexity.
> Premature optimization is still a sin. Design before profiling. But flag O(n²) or worse.

09 // CODE QUALITY & ANNOTATION

> Readable by a stranger in 6 months. Functions do one thing.
> No magic numbers or strings — named constants with clear origin.
> Error handling is explicit. Silent failures are bugs.

INLINE ANNOTATION TAGS:

// TECH DEBT:Shortcut, needs resolution

// PERF:Performance concern

// SECURITY:Security-sensitive, review carefully

// PQC DEBT:Classical crypto, needs migration

// FIXME:Broken, must fix before release

// SCHEMA:Fragile if schema changes

// VENDOR LOCK-IN:Vendor dependency, needs escape path

// DEPENDENCY RISK:Stale or under-maintained dep

10 // CONVENTIONAL COMMITS

Write the subject line in the imperative mood. Imagine the phrase completes: "If applied, this commit will..."

<type>(optional scope): short summary

feat:New feature or capability

fix:Bug fix

docs:Documentation-only changes

refactor:Neither fix nor feature

perf:Performance improvement

test:Adding or correcting tests

chore:Routine tasks, build processes

style:Formatting changes

EXAMPLES:

Backend: feat(api): add patient redaction endpoint

Data: fix(etl): resolve timestamp anomaly in incremental load

Frontend: refactor(ui): extract vaporwave button into reusable component

11 // DOCUMENTATION & ISSUE TRACKING

> READMEs evolve with the code, not written after. Required: quickstart, config reference, architecture overview, testing, deployment, troubleshooting.
> The Minimal Working Example is non-negotiable. Runnable in under 10 minutes using only the README.
> Bugs that consume 1+ hour of debugging get documented. Format: Date, Environment, Symptoms, Root Cause, Fix, Prevention.
> Three or more similar issues = architectural smell. Escalate to an Architecture Decision Record.

12 // PREFERRED PATTERNS & IDIOMS

FUNDAMENTALS:

> Hexagonal Architecture (Ports & Adapters)

> Repository Pattern — data access abstracted

> Strategy Pattern — behaviors swappable

> Factory / Builder — readable object creation

> CQRS — separate read/write when justified

> Event-driven — decoupling across contexts

> Kimball Dimensional Modeling (star default)

> Dependency Injection over hard dependencies

PATTERNS I USE REPEATEDLY (from my own projects):

> Dual storage backends — identical port trait, swap at startup. Example: Postgres ↔ flat JSON in ResumeForge. Lets the same code ship in a database-present and database-absent form.
> Crypto-agile provider interfaces — a single CryptoProvider trait, multiple implementations. Example: Signal-Lens, so adding post-quantum primitives is a new implementation, not a rewrite.
> Independent crate/module boundaries with clean API contracts. Example: Nexus's 9-crate architecture — each crate does one thing, contract is its public types.
> Monorepo for development, independent publication for consumption. Example: CivicLens/Sentinel — develop together for velocity, ship separately so each crate can be consumed without the rest.

13 // SQL STYLE

> All lowercase keywords. SQL reads as prose, not shouting.
> Comma-first field lists. Enables running queries from the end without selecting all — cursor-friendly debugging.
> No blank lines between CTEs. Lets cursor-based execution target exactly one CTE at a time.
> CTEs over nested subqueries. Every named CTE is a unit of thought. Easier to read, easier to refactor, easier to debug.
> Explicit column lists. Never SELECT * in production — schema drift silently breaks downstream.
> snake_case for all identifiers. No ambiguity, no case-sensitivity surprises.
> Trailing commas on the last field. Cleaner diffs when adding or removing columns.
> Table aliases always explicit, 1-3 chars. Every column reference names its table.
> Explicit JOIN type (INNER JOIN, LEFT JOIN), never bare JOIN — forces semantic intent.

> see also: [04] Data Modeling · [14] Data Quality

14 // DATA QUALITY & PIPELINE TESTING

> Pipelines get validation tests, not just code tests. A pipeline that runs green but produces wrong data is still broken.
> Edge cases and failure paths tested explicitly. What happens on empty input? Malformed rows? Duplicate keys? Late-arriving data?
> Data contracts between upstream sources and downstream consumers. Breaks are detected at the boundary before they propagate.
> Monitoring and alerting for pipeline health. Run duration, row counts, freshness, error rates — all tracked.
> Test counts are minimum coverage, not a target. "900+ tests" isn't the brag — "every boundary and edge case covered" is.
> Data tests and code tests counted separately. They catch different bugs and belong to different disciplines. Don't blend the numbers.
> Test at the layer boundary: staging gets source-freshness and row-count, marts get business-logic, analytics gets k-anonymity and small-group suppression.

> see also: [02] TDD · [04] Data Modeling · [17] Observability

15 // AI & LLM INTEGRATION

> PII stripped before any outbound API call. If it's personal, it doesn't leave the machine without consent and without transformation.
> User sees the exact payload in a consent dialog before send. Nothing goes over the wire that the user didn't see first.
> Every call logged: model, prompt, response, tokens, cost, timestamp. No silent spend, full audit trail.
> Multi-model evaluation. Don't trust a single vendor. Test the same prompt across models before baking the choice into a system.
> Determinism boundaries are marked. Schema-validated outputs are one thing; probabilistic outputs are another. Readers of the code always know which is which.
> Cost awareness is first-class. Per-conversation cost tracking, split input/output pricing per model, alerts on anomalous spend.
> Fallback behavior defined for refusal, parse failure, and timeout. Real systems break at these boundaries; an LLM without a fallback is a production outage waiting to happen.
> AI-generated content tracked with provenance. Which model, which prompt, which version, which human approved it before it shipped.

> see also: [05] Security · [16] Privacy · [17] Observability

16 // PRIVACY ENGINEERING

> Privacy is a design constraint, not a compliance checkbox. If you bolt it on at the end, you've already lost.
> Data minimization by default. Collect the least you need to do the job. Every field is a future liability.
> PII classified at the column level. Column-level tags drive tooling: masking, access control, auditing, retention.
> Jurisdiction-aware consent tracking where applicable. Consent granted in one jurisdiction doesn't grant access in another.
> Retention policies defined at schema level, not buried in application code. The schema knows when a record expires.
> Pseudonymization over anonymization where re-linkage may be needed. hash_pii() with a stable salt preserves joinability without exposing the original.
> Right-to-be-forgotten workflows: cascading deletes, tombstones where needed, and an audit trail of the deletion itself.

> see also: [05] Security · [15] AI/LLM

17 // OBSERVABILITY & MONITORING

> Structured logs with consistent fields (timestamp, level, request_id, user_id where safe). Logs are a queryable stream, not a stack of strings.
> Log levels used intentionally. DEBUG / INFO / WARN / ERROR / FATAL each mean a specific thing. Don't INFO everything.
> Health endpoints on every long-running service. Orchestrator needs to know if the thing is alive AND ready.
> Metrics over logs for anything counted — request counts, latencies, error rates. Log lines for things and events; metrics for aggregates.
> Dashboards answer questions, not collect metrics. Show the operator what they need to decide; hide everything else.
> Alerting thresholds tuned to avoid fatigue. Every page must be actionable. A page that can't be acted on is noise training people to ignore the real ones.
> PII never in logs. Cross-reference [16] — logs are long-lived and widely readable.

> see also: [05] Security · [14] Data Quality · [16] Privacy

18 // CODE REVIEW

> Every non-trivial change gets review. Solo projects included — review future-you by leaving a self-review checklist in the PR.
> Reviewers check: does it do what the commit message says? Are tests present and correct? Is error handling explicit? Does it regress anything obvious?
> Reviewers don't check style (the linter does that) or personal preference (call it out as preference, not a block).
> Small PRs get faster review. Batch by concern, not by day. A huge multi-concern PR is a bad PR before anyone reads it.
> Authors respond to every comment, even if just to acknowledge. Silent dismissal is disrespectful and builds debt.
> Approvals require all conversations resolved. An open thread is an open question; ship the answer, not the question.
> No self-approval on shared codebases. Self-merge only after a cooling-off period and only with documented justification.

> see also: [02] TDD · [10] Conventional Commits