DATA PIPELINE
> END-TO-END ARCHITECTURE // GENERATE → LOAD → TRANSFORM → VISUALIZE
GENERATE — Synthetic Telemetry
Python script generates 500K realistic LLM API requests using statistical distributions: log-normal latency, Pareto token counts, sinusoidal diurnal + weekly traffic patterns, weighted HTTP status codes, Bernoulli safety flags.
LOAD — PostgreSQL Bulk Ingest
COPY-based bulk loading into raw_staging schema. Creates read-only portfolio_reader role. Idempotent: DROP IF EXISTS + CREATE pattern for safe re-runs.
TRANSFORM — dbt Core
14 dbt models across 3 layers: staging (views), star schema (tables), snowflake schema (tables). 143 tests covering not_null, unique, relationships, and accepted_values constraints. Surrogate keys via dbt_utils.generate_surrogate_key.
VISUALIZE — Metabase + Streamlit
Self-hosted Metabase for executive dashboards with embedded iframes. Self-hosted Streamlit for the interactive Token Forecaster application. Both run as Docker containers with health checks.
traefik
Reverse proxy | TLS termination
Only entry point — ports 80/443
portfolio_db
PostgreSQL | internal network
Primary analytical database
metabase
Metabase OSS | public dashboard only
Executive dashboards & BI
streamlit
Streamlit | internal network
Token Forecaster application
frontend
Nginx (static) | Astro build
Portfolio site — static HTML/CSS/JS
sonarqube
SonarQube CE | dev-only
Static analysis & quality gates
services:
traefik:
image: traefik:v3.4
ports: ["80:80", "443:443"] # Only exposed ports
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./traefik/traefik.yml:/etc/traefik/traefik.yml:ro
portfolio_db:
image: postgres:16
networks: [internal] # No host port — internal only
healthcheck:
test: ["CMD-SHELL", "pg_isready -U $PORTFOLIO_DB_USER"]
metabase:
labels:
# Only /public/ paths routed — admin blocked
- "traefik.http.routers.metabase-public.rule=PathPrefix('/metabase/public')"
networks: [internal, web]
streamlit:
labels:
- "traefik.http.routers.streamlit.rule=PathPrefix('/forecaster')"
networks: [internal, web] flowchart LR
subgraph SRC["raw_staging"]
api_requests["api_requests"]
users["users"]
models["models"]
endpoints["endpoints"]
dates["dates"]
end
subgraph STG["staging (views)"]
stg_api["stg_api_requests"]
stg_users["stg_users"]
stg_models["stg_models"]
stg_endpoints["stg_endpoints"]
stg_dates["stg_dates"]
end
subgraph STAR["public_star"]
dim_users["dim_users"]
dim_models_s["dim_models"]
dim_date_s["dim_date"]
dim_endpoints_s["dim_endpoints"]
fct_star["fct_api_requests"]
end
subgraph SNOW["public_snowflake"]
dim_companies["dim_companies"]
dim_billing["dim_billing_tiers"]
dim_users_sf["dim_users_sf"]
dim_models_sf["dim_models"]
dim_date_sf["dim_date"]
dim_endpoints_sf["dim_endpoints"]
fct_snow["fct_api_requests_sf"]
end
api_requests --> stg_api
users --> stg_users
models --> stg_models
endpoints --> stg_endpoints
dates --> stg_dates
stg_users --> dim_users
stg_models --> dim_models_s
stg_dates --> dim_date_s
stg_endpoints --> dim_endpoints_s
stg_api --> fct_star
dim_users --> fct_star
dim_models_s --> fct_star
dim_date_s --> fct_star
dim_endpoints_s --> fct_star
stg_users --> dim_companies
stg_users --> dim_users_sf
dim_companies --> dim_users_sf
dim_billing --> dim_users_sf
stg_models --> dim_models_sf
stg_dates --> dim_date_sf
stg_endpoints --> dim_endpoints_sf
stg_api --> fct_snow
dim_users_sf --> fct_snow
dim_models_sf --> fct_snow
dim_date_sf --> fct_snow
dim_endpoints_sf --> fct_snow The synthetic dataset is not random noise — each column uses a domain-appropriate statistical distribution that mirrors real-world LLM API telemetry patterns. This matters because realistic data exposes real query planner behaviors and index selectivity.
LATENCY — Log-Normal
μ=6.5, σ=0.8 → median ~660ms, p99 ~5,000ms
Why: API latencies cluster around a median but exhibit heavy right tails from cold starts, queue contention, and large context windows. Log-normal captures this naturally — multiplicative noise on a base latency.
TOKEN COUNTS — Pareto
α=1.5, scale=50 → most requests small, heavy tail to 100K
Why: Token usage follows a power law — most calls use short prompts, but a small fraction of enterprise users send massive context windows. Pareto distribution models this 80/20 pattern.
TRAFFIC PATTERN — Sinusoidal
Diurnal peak ~12:30, overnight -85%, weekend -40%
Why: API traffic follows business hours. Rejection sampling with sin(2π(h-6.5)/24) modulation creates realistic day/night cycles. Weekend dampening and overnight suppression (2am-5am) added separately.
HTTP STATUS — Weighted Categorical
200: 94%, 429: 3%, 500: 2%, 401: 1%
Why: Production APIs are dominated by successes. Rate limiting (429) is the most common error, followed by server errors and auth failures. Weighted random choice with fixed probabilities.
SAFETY FLAGS — Bernoulli
p=0.008 → ~0.8% of requests flagged
Why: Content safety triggers are rare binary events with a fixed probability per request — textbook Bernoulli trial.
USER REQUEST VOLUME — Power Law
Enterprise: 20× weight, Free: 1× weight
Why: A small fraction of enterprise users generate the majority of traffic. Tier-based weighting (Free:1, Starter:3, Pro:8, Enterprise:20) creates the expected skew in per-user request counts.
REPRODUCIBILITY
All distributions seeded with numpy.random.default_rng(42). Faker seeded separately.
Identical output on any platform: python generate_data.py --seed 42 --rows 500000
Every component is self-hosted and open-source. No runtime dependency on big-tech hosted services. All services are deployable on infrastructure you control via Docker Compose.
Fonts are self-hosted (no Google CDN). Analytics via Metabase OSS (no Google Analytics). Interactive apps via self-hosted Streamlit (not Streamlit Community Cloud).