PRIVACY PIPELINE
> GENERATE → STAGE → TRANSFORM → ANALYZE // PRIVACY-FIRST DATA FLOWS
GENERATE — Synthetic Privacy Data
Python script generates 50K users with PII, 2M viewing events, 200K consent events, 10K DSARs, and 50K retention actions across 20 jurisdictions. Pareto viewing distribution, jurisdiction-weighted consent rates, Poisson DSAR arrivals. All data fully synthetic — no real PII used, even in demos.
STAGE — Privacy Staging Schema
COPY-based bulk loading into privacy_staging.
Raw PII retained here for transformation. Separate
privacy_reader role
cannot access this schema — least privilege enforced.
TRANSFORM — dbt Privacy Models
20 dbt models: 7 staging views, 9 mart tables (5 dims + 4 facts), 4 analytics tables.
PII hashed via hash_pii() macro.
Age generalized to bands. Every column tagged with
data_classification.
ANALYZE — Privacy-Safe Analytics
k-anonymity enforcement via suppress_small_groups() macro.
Consent compliance rates by jurisdiction. DSAR SLA performance tracking.
Retention conflict detection (regulatory minimum vs privacy maximum).
No user-level data in analytics layer.
Each Data Subject Access Request passes through a 5-stage pipeline with jurisdiction-specific SLAs. Stage transitions are tracked as a separate fact table for audit trail compliance. 80% of ACCESS requests are automated; DELETION and PORTABILITY require manual review.
RECEIVED
SLA timer starts
IDENTITY VERIFIED
Requestor confirmed
DATA COLLECTED
Cross-system gather
REVIEW
Legal/privacy check
FULFILLED
Response delivered
Retention policies create inherent legal conflicts.
Payment records require minimum 7-year retention under tax law, but GDPR's data minimization
principle demands deletion when no longer necessary. The model captures these conflicts explicitly
via a has_retention_conflict flag.
MUST RETAIN (Tax Law)
Payment records: 2,555 days (~7 years)
Deletion method: ANONYMIZE (not hard delete)
MUST DELETE (GDPR)
Session logs: 90 days
Deletion method: HARD_DELETE
flowchart LR
subgraph GEN["Data Generation"]
users["users_privacy"]
views["viewing_history"]
consent["consent_events"]
dsar["dsar_requests"]
stages["dsar_stages"]
retention["retention_policies"]
actions["retention_actions"]
end
subgraph STG["privacy_staging (views)"]
stg_users["stg_privacy_users"]
stg_views["stg_viewing_history"]
stg_consent["stg_consent_events"]
stg_dsar["stg_dsar_requests"]
stg_stages["stg_dsar_stages"]
stg_retention["stg_retention_policies"]
stg_actions["stg_retention_actions"]
end
subgraph MART["privacy_mart (tables)"]
dim_users["dim_users_privacy"]
dim_jur["dim_jurisdictions"]
dim_purpose["dim_consent_purposes"]
dim_cat["dim_data_categories"]
fct_consent["fct_consent_events"]
fct_dsar["fct_dsar_requests"]
fct_stages["fct_dsar_stages"]
fct_ret["fct_retention_actions"]
end
subgraph ANALYTICS["privacy_analytics (tables)"]
safe_view["mart_privacy_safe_viewing"]
compliance["mart_consent_compliance"]
sla["mart_dsar_sla_performance"]
coverage["mart_retention_coverage"]
end
users --> stg_users
views --> stg_views
consent --> stg_consent
dsar --> stg_dsar
stages --> stg_stages
retention --> stg_retention
actions --> stg_actions
stg_users --> dim_users
stg_consent --> fct_consent
stg_dsar --> fct_dsar
stg_stages --> fct_stages
stg_actions --> fct_ret
stg_retention --> fct_ret
dim_users --> fct_consent
dim_jur --> fct_consent
dim_purpose --> fct_consent
dim_users --> fct_dsar
dim_jur --> fct_dsar
fct_dsar --> fct_stages
fct_consent --> compliance
fct_dsar --> sla
fct_ret --> coverage
stg_views --> safe_view
dim_users --> safe_view This portfolio uses dbt Core on PostgreSQL — appropriate for demonstration scale. In production at streaming-service scale, these patterns would map to: