AD PIPELINE
> GENERATE → STAGE → TRANSFORM → ANALYZE // AD PLATFORM DATA FLOWS
GENERATE — Synthetic Ads Data
Python script generates 200K+ ad impressions across 10 verticals, 50 campaigns, and 30 advertisers. Impression timing follows viewing-hour distributions with prime-time peaks. Auction prices use log-normal distributions. Completion rates vary by ad slot position (pre-roll, mid-roll, post-roll).
LOAD — PostgreSQL Bulk Ingest
COPY-based bulk loading into ads_staging.
Read-only portfolio_reader role
granted SELECT on mart and analytics schemas only. Staging schema remains
inaccessible to presentation layer.
TRANSFORM — dbt Ad Models
20 dbt models: 6 staging views, 7 mart tables (5 dims + 2 facts), 4 analytics tables.
Surrogate keys via dbt_utils.generate_surrogate_key().
Campaign pacing calculated as a dedicated fact table with daily snapshots.
HHI concentration computed as an analytics-layer aggregate.
ANALYZE — Platform Health & Campaign Performance
Analytics layer computes platform-level KPIs (fill rate, eCPM, RPSH), per-campaign pacing status with over/under detection, frequency cap compliance rates, and advertiser revenue mix with HHI concentration scoring. All metrics have defined thresholds for operational alerting.
flowchart LR
subgraph GEN["Data Generation"]
impressions["ad_impressions"]
campaigns["campaigns"]
creatives["creatives"]
advertisers["advertisers"]
subscribers["subscribers"]
content["content_catalog"]
end
subgraph STG["ads_staging (views)"]
stg_imp["stg_ad_impressions"]
stg_camp["stg_campaigns"]
stg_cre["stg_creatives"]
stg_adv["stg_advertisers"]
stg_sub["stg_subscribers"]
stg_con["stg_content"]
end
subgraph MART["ads_mart (tables)"]
dim_camp["dim_campaigns"]
dim_adv["dim_advertisers"]
dim_con["dim_content"]
dim_sub["dim_subscribers"]
dim_date["dim_date_ads"]
fct_imp["fct_ad_impressions"]
fct_pacing["fct_campaign_pacing"]
end
subgraph ANALYTICS["ads_analytics (tables)"]
platform["mart_platform_health"]
pacing["mart_campaign_performance"]
frequency["mart_frequency_analysis"]
revenue["mart_revenue_mix"]
end
impressions --> stg_imp
campaigns --> stg_camp
creatives --> stg_cre
advertisers --> stg_adv
subscribers --> stg_sub
content --> stg_con
stg_imp --> fct_imp
stg_camp --> dim_camp
stg_camp --> fct_pacing
stg_cre --> dim_camp
stg_adv --> dim_adv
stg_sub --> dim_sub
stg_con --> dim_con
dim_camp --> fct_imp
dim_adv --> fct_imp
dim_con --> fct_imp
dim_sub --> fct_imp
dim_date --> fct_imp
dim_camp --> fct_pacing
fct_imp --> platform
fct_imp --> frequency
fct_imp --> revenue
fct_pacing --> pacing # Ad impressions follow viewing-hour patterns
# Prime time (7-11pm) gets 3x base impression volume
# Weekend viewing hours shift earlier
import numpy as np
from datetime import datetime, timedelta
def generate_impression_timestamps(n_impressions, start_date, end_date):
"""Generate timestamps with realistic viewing-hour distribution."""
hours = np.random.choice(
range(24),
size=n_impressions,
p=HOUR_WEIGHTS # Peak at 8-10pm, trough at 3-5am
)
# Auction clearing prices: log-normal (median ~$12 eCPM)
clearing_prices = np.random.lognormal(mean=2.5, sigma=0.6, size=n_impressions)
# Completion rates by slot position
COMPLETION_RATES = {
'pre_roll': np.random.beta(46, 4, size=n), # ~92% mean
'mid_roll': np.random.beta(34, 6, size=n), # ~85% mean
'post_roll': np.random.beta(11, 9, size=n), # ~55% mean
} This portfolio uses dbt Core on PostgreSQL — appropriate for demonstration scale. In production at streaming-ad-platform scale, these patterns would map to: