DATA PIPELINE

> END-TO-END ARCHITECTURE // GENERATE → LOAD → TRANSFORM → VISUALIZE

PIPELINE STAGES

GENERATE — Synthetic Telemetry

Python script generates 500K realistic LLM API requests using statistical distributions: log-normal latency, Pareto token counts, sinusoidal diurnal + weekly traffic patterns, weighted HTTP status codes, Bernoulli safety flags.

LOAD — PostgreSQL Bulk Ingest

COPY-based bulk loading into raw_staging schema. Creates read-only portfolio_reader role. Idempotent: DROP IF EXISTS + CREATE pattern for safe re-runs.

TRANSFORM — dbt Core

14 dbt models across 3 layers: staging (views), star schema (tables), snowflake schema (tables). 143 tests covering not_null, unique, relationships, and accepted_values constraints. Surrogate keys via dbt_utils.generate_surrogate_key.

VISUALIZE — Metabase + Streamlit

Self-hosted Metabase for executive dashboards with embedded iframes. Self-hosted Streamlit for the interactive Token Forecaster application. Both run as Docker containers with health checks.

INFRASTRUCTURE — DOCKER COMPOSE

traefik

Reverse proxy | TLS termination

Only entry point — ports 80/443

portfolio_db

PostgreSQL | internal network

Primary analytical database

metabase

Metabase OSS | public dashboard only

Executive dashboards & BI

streamlit

Streamlit | internal network

Token Forecaster application

frontend

Nginx (static) | Astro build

Portfolio site — static HTML/CSS/JS

sonarqube

SonarQube CE | dev-only

Static analysis & quality gates

docker-compose.yml — excerpt

services:
  traefik:
    image: traefik:v3.4
    ports: ["80:80", "443:443"]  # Only exposed ports
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik/traefik.yml:/etc/traefik/traefik.yml:ro

  portfolio_db:
    image: postgres:16
    networks: [internal]         # No host port — internal only
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U $PORTFOLIO_DB_USER"]

  metabase:
    labels:
      # Only /public/ paths routed — admin blocked
      - "traefik.http.routers.metabase-public.rule=PathPrefix('/metabase/public')"
    networks: [internal, web]

  streamlit:
    labels:
      - "traefik.http.routers.streamlit.rule=PathPrefix('/forecaster')"
    networks: [internal, web]

dbt MODEL LINEAGE — 14 MODELS ACROSS 3 LAYERS

flowchart LR
    subgraph SRC["raw_staging"]
        api_requests["api_requests"]
        users["users"]
        models["models"]
        endpoints["endpoints"]
        dates["dates"]
    end

    subgraph STG["staging (views)"]
        stg_api["stg_api_requests"]
        stg_users["stg_users"]
        stg_models["stg_models"]
        stg_endpoints["stg_endpoints"]
        stg_dates["stg_dates"]
    end

    subgraph STAR["public_star"]
        dim_users["dim_users"]
        dim_models_s["dim_models"]
        dim_date_s["dim_date"]
        dim_endpoints_s["dim_endpoints"]
        fct_star["fct_api_requests"]
    end

    subgraph SNOW["public_snowflake"]
        dim_companies["dim_companies"]
        dim_billing["dim_billing_tiers"]
        dim_users_sf["dim_users_sf"]
        dim_models_sf["dim_models"]
        dim_date_sf["dim_date"]
        dim_endpoints_sf["dim_endpoints"]
        fct_snow["fct_api_requests_sf"]
    end

    api_requests --> stg_api
    users --> stg_users
    models --> stg_models
    endpoints --> stg_endpoints
    dates --> stg_dates

    stg_users --> dim_users
    stg_models --> dim_models_s
    stg_dates --> dim_date_s
    stg_endpoints --> dim_endpoints_s
    stg_api --> fct_star
    dim_users --> fct_star
    dim_models_s --> fct_star
    dim_date_s --> fct_star
    dim_endpoints_s --> fct_star

    stg_users --> dim_companies
    stg_users --> dim_users_sf
    dim_companies --> dim_users_sf
    dim_billing --> dim_users_sf
    stg_models --> dim_models_sf
    stg_dates --> dim_date_sf
    stg_endpoints --> dim_endpoints_sf
    stg_api --> fct_snow
    dim_users_sf --> fct_snow
    dim_models_sf --> fct_snow
    dim_date_sf --> fct_snow
    dim_endpoints_sf --> fct_snow

DATA GENERATION — STATISTICAL DISTRIBUTIONS

The synthetic dataset is not random noise — each column uses a domain-appropriate statistical distribution that mirrors real-world LLM API telemetry patterns. This matters because realistic data exposes real query planner behaviors and index selectivity.

LATENCY — Log-Normal

μ=6.5, σ=0.8 → median ~660ms, p99 ~5,000ms

Why: API latencies cluster around a median but exhibit heavy right tails from cold starts, queue contention, and large context windows. Log-normal captures this naturally — multiplicative noise on a base latency.

TOKEN COUNTS — Pareto

α=1.5, scale=50 → most requests small, heavy tail to 100K

Why: Token usage follows a power law — most calls use short prompts, but a small fraction of enterprise users send massive context windows. Pareto distribution models this 80/20 pattern.

TRAFFIC PATTERN — Sinusoidal

Diurnal peak ~12:30, overnight -85%, weekend -40%

Why: API traffic follows business hours. Rejection sampling with sin(2π(h-6.5)/24) modulation creates realistic day/night cycles. Weekend dampening and overnight suppression (2am-5am) added separately.

HTTP STATUS — Weighted Categorical

200: 94%, 429: 3%, 500: 2%, 401: 1%

Why: Production APIs are dominated by successes. Rate limiting (429) is the most common error, followed by server errors and auth failures. Weighted random choice with fixed probabilities.

SAFETY FLAGS — Bernoulli

p=0.008 → ~0.8% of requests flagged

Why: Content safety triggers are rare binary events with a fixed probability per request — textbook Bernoulli trial.

USER REQUEST VOLUME — Power Law

Enterprise: 20× weight, Free: 1× weight

Why: A small fraction of enterprise users generate the majority of traffic. Tier-based weighting (Free:1, Starter:3, Pro:8, Enterprise:20) creates the expected skew in per-user request counts.

REPRODUCIBILITY

All distributions seeded with numpy.random.default_rng(42). Faker seeded separately. Identical output on any platform: python generate_data.py --seed 42 --rows 500000

TECHNOLOGY SOVEREIGNTY

Every component is self-hosted and open-source. No runtime dependency on big-tech hosted services. All services are deployable on infrastructure you control via Docker Compose.

Fonts are self-hosted (no Google CDN). Analytics via Metabase OSS (no Google Analytics). Interactive apps via self-hosted Streamlit (not Streamlit Community Cloud).