DATA MODELS

> KIMBALL DIMENSIONAL MODELING // STAR & SNOWFLAKE SCHEMAS

MODELING APPROACH

This project implements two parallel schema designs over the same source data — demonstrating both the standard Kimball star schema and its normalized snowflake variant side-by-side.

The source data simulates 500K LLM API requests across 5,000 users, 5 models, and 5 endpoints over a 91-day window. Statistical distributions mirror real-world telemetry: log-normal latency, Pareto token counts, sinusoidal diurnal traffic patterns.

STAR SCHEMA

SCHEMA: public_star | MODELS: 5 (1 fact + 4 dimensions)

The star schema is the default choice for analytical workloads. One central fact table surrounded by denormalized dimension tables — fewer joins, faster queries.

fct_api_requests

Grain: one row per API request

500K rows | Surrogate FK to all dims

Measures: latency_ms, tokens_*, cost_usd

dim_users — 5K rows, denormalized
dim_models — 5 rows, LLM model metadata
dim_date — 91 rows, calendar dimension
dim_endpoints — 5 rows, API endpoint metadata
SNOWFLAKE SCHEMA

SCHEMA: public_snowflake | MODELS: 7 (1 fact + 6 dimensions)

The snowflake schema normalizes the user dimension into separate company and billing tier tables. This reduces redundancy at the cost of additional joins — justified when dimension tables are large or frequently updated.

fct_api_requests_sf

Same grain, refs normalized dims

dim_users_sf

FK → dim_companies, dim_billing_tiers

dim_companies

Extracted from users via mode()

dim_billing_tiers

Hardcoded reference: Free/Starter/Pro/Enterprise

dim_models

Shared with star schema

dim_date / dim_endpoints

Shared with star schema

ERD — STAR SCHEMA
erDiagram
    fct_api_requests ||--o{ dim_users : "user_key"
    fct_api_requests ||--o{ dim_models : "model_key"
    fct_api_requests ||--o{ dim_date : "date_key"
    fct_api_requests ||--o{ dim_endpoints : "endpoint_key"

    fct_api_requests {
        string request_id PK
        string user_key FK
        string model_key FK
        int date_key FK
        string endpoint_key FK
        numeric latency_ms
        int tokens_input
        int tokens_output
        int tokens_total
        numeric cost_usd
        int http_status
        boolean safety_flag
        timestamp request_timestamp
    }

    dim_users {
        string user_key PK
        string user_id UK
        string billing_tier
        string company_name
        string industry
        date signup_date
        string region
    }

    dim_models {
        string model_key PK
        string model_id UK
        string model_family
        numeric cost_per_input_token
        numeric cost_per_output_token
        int context_window_k
        boolean is_active
    }

    dim_date {
        int date_key PK
        date full_date
        string day_of_week
        boolean is_weekend
        int week_of_year
        string month_name
        string quarter
        int year
    }

    dim_endpoints {
        string endpoint_key PK
        string endpoint_path UK
        string api_version
        boolean is_deprecated
    }
ERD — SNOWFLAKE SCHEMA
erDiagram
    fct_api_requests_sf ||--o{ dim_users_sf : "user_key"
    fct_api_requests_sf ||--o{ dim_models : "model_key"
    fct_api_requests_sf ||--o{ dim_date : "date_key"
    fct_api_requests_sf ||--o{ dim_endpoints : "endpoint_key"
    dim_users_sf ||--o{ dim_companies : "company_key"
    dim_users_sf ||--o{ dim_billing_tiers : "billing_tier_key"

    fct_api_requests_sf {
        string request_id PK
        string user_key FK
        string model_key FK
        int date_key FK
        string endpoint_key FK
        numeric latency_ms
        int tokens_input
        int tokens_output
        int tokens_total
        numeric cost_usd
        int http_status
        boolean safety_flag
        timestamp request_timestamp
    }

    dim_users_sf {
        string user_key PK
        string user_id UK
        string company_key FK
        string billing_tier_key FK
        date signup_date
        string region
    }

    dim_companies {
        string company_key PK
        string company_name UK
        string industry
        string company_size_band
        string hq_country
    }

    dim_billing_tiers {
        string billing_tier_key PK
        string tier_name UK
        numeric monthly_price_usd
        int rate_limit_rpm
        boolean has_priority_access
    }

    dim_models {
        string model_key PK
        string model_id UK
        string model_family
        numeric cost_per_input_token
        numeric cost_per_output_token
        int context_window_k
        boolean is_active
    }

    dim_date {
        int date_key PK
        date full_date
        string day_of_week
        boolean is_weekend
        int week_of_year
        string month_name
        string quarter
        int year
    }

    dim_endpoints {
        string endpoint_key PK
        string endpoint_path UK
        string api_version
        boolean is_deprecated
    }
dbt — dim_companies.sql
with users as (
    select * from {{ ref('stg_users') }}
),

companies as (
    select
        company_name,
        -- Pick the most common industry per company to avoid fan-out
        mode() within group (order by industry) as industry
    from users
    group by company_name
)

select
    {{ dbt_utils.generate_surrogate_key(['company_name']) }} as company_key,
    company_name,
    industry,
    case
        when length(company_name) < 15 then 'Small'
        when length(company_name) < 25 then 'Medium'
        else 'Large'
    end as company_size_band,
    'US' as hq_country
from companies
TEST COVERAGE
Total dbt tests 143 PASS
Test types not_null, unique, relationships, accepted_values
Schemas tested staging, star, snowflake
QUERY PERFORMANCE — EXPLAIN ANALYZE

The schema tradeoff is measurable. Below are real EXPLAIN ANALYZE results comparing the star schema (denormalized) against the snowflake schema (normalized) for the same business question: "Total cost by industry."

STAR: Cost by Model (1 join) 197ms
HashAggregate  (rows=5)
  └─ Hash Join  (cost=0.16..19544)     ← O(1) hash probe per row
       └─ Gather Merge
            └─ Parallel Seq Scan       ← O(n) scan, n=500K
                 on fct_api_requests     Workers: 2
Complexity: O(n) sequential scan + O(1) hash probe × n rows. Parallelized across 2 workers → effective O(n/p).
SNOWFLAKE: Cost by Industry (3 joins) 206ms
HashAggregate  (rows=10)
  └─ Hash Join  (cost=427..25411)      ← join #3: users → companies
       └─ Hash Join                    ← join #2: fact → users
            └─ Hash Join               ← join #1: fact scan
                 └─ Seq Scan             on fct_api_requests_sf
                      (rows=500K)        O(n) full scan, no parallelism
Complexity: O(n) scan + O(k) hash build (users) + O(m) hash build (companies). 3 hash tables in memory. No parallelism available.
STAR: Cost by Industry (1 join, denormalized) 51ms ← 4× faster
HashAggregate  (rows=10)
  └─ Hash Join  (cost=233..19760)      ← single join: fact → dim_users
       └─ Gather Merge
            └─ Parallel Seq Scan       ← O(n) scan, n=500K
                 on fct_api_requests     Workers: 2, parallelized
Complexity: O(n) scan + single O(1) hash probe. Industry denormalized into dim_users — no chained joins.

TAKEAWAY

Same business question — "total cost by industry" — answered in 51ms (star, 1 join) vs 206ms (snowflake, 3 joins). The denormalized star schema is 4× faster because:

  • + Fewer hash table builds (1 vs 3) — reduces memory pressure
  • + Parallel seq scan enabled — O(n/p) vs O(n)
  • + No join cascading — each additional join adds O(k) hash build overhead

This is the Kimball tradeoff: storage redundancy buys query speed. The snowflake schema earns its cost when dimension tables are large, frequently updated, or when write-path consistency matters more than read performance.

GENERATED DOCUMENTATION

Full dbt-generated documentation including lineage graph, column-level descriptions, and test results.

[OPEN DBT DOCS]