DATA MODELS
> KIMBALL DIMENSIONAL MODELING // STAR & SNOWFLAKE SCHEMAS
This project implements two parallel schema designs over the same source data — demonstrating both the standard Kimball star schema and its normalized snowflake variant side-by-side.
The source data simulates 500K LLM API requests across 5,000 users, 5 models, and 5 endpoints over a 91-day window. Statistical distributions mirror real-world telemetry: log-normal latency, Pareto token counts, sinusoidal diurnal traffic patterns.
SCHEMA: public_star | MODELS: 5 (1 fact + 4 dimensions)
The star schema is the default choice for analytical workloads. One central fact table surrounded by denormalized dimension tables — fewer joins, faster queries.
fct_api_requests
Grain: one row per API request
500K rows | Surrogate FK to all dims
Measures: latency_ms, tokens_*, cost_usd
SCHEMA: public_snowflake | MODELS: 7 (1 fact + 6 dimensions)
The snowflake schema normalizes the user dimension into separate company and billing tier tables. This reduces redundancy at the cost of additional joins — justified when dimension tables are large or frequently updated.
fct_api_requests_sf
Same grain, refs normalized dims
dim_users_sf
FK → dim_companies, dim_billing_tiers
dim_companies
Extracted from users via mode()
dim_billing_tiers
Hardcoded reference: Free/Starter/Pro/Enterprise
dim_models
Shared with star schema
dim_date / dim_endpoints
Shared with star schema
erDiagram
fct_api_requests ||--o{ dim_users : "user_key"
fct_api_requests ||--o{ dim_models : "model_key"
fct_api_requests ||--o{ dim_date : "date_key"
fct_api_requests ||--o{ dim_endpoints : "endpoint_key"
fct_api_requests {
string request_id PK
string user_key FK
string model_key FK
int date_key FK
string endpoint_key FK
numeric latency_ms
int tokens_input
int tokens_output
int tokens_total
numeric cost_usd
int http_status
boolean safety_flag
timestamp request_timestamp
}
dim_users {
string user_key PK
string user_id UK
string billing_tier
string company_name
string industry
date signup_date
string region
}
dim_models {
string model_key PK
string model_id UK
string model_family
numeric cost_per_input_token
numeric cost_per_output_token
int context_window_k
boolean is_active
}
dim_date {
int date_key PK
date full_date
string day_of_week
boolean is_weekend
int week_of_year
string month_name
string quarter
int year
}
dim_endpoints {
string endpoint_key PK
string endpoint_path UK
string api_version
boolean is_deprecated
} erDiagram
fct_api_requests_sf ||--o{ dim_users_sf : "user_key"
fct_api_requests_sf ||--o{ dim_models : "model_key"
fct_api_requests_sf ||--o{ dim_date : "date_key"
fct_api_requests_sf ||--o{ dim_endpoints : "endpoint_key"
dim_users_sf ||--o{ dim_companies : "company_key"
dim_users_sf ||--o{ dim_billing_tiers : "billing_tier_key"
fct_api_requests_sf {
string request_id PK
string user_key FK
string model_key FK
int date_key FK
string endpoint_key FK
numeric latency_ms
int tokens_input
int tokens_output
int tokens_total
numeric cost_usd
int http_status
boolean safety_flag
timestamp request_timestamp
}
dim_users_sf {
string user_key PK
string user_id UK
string company_key FK
string billing_tier_key FK
date signup_date
string region
}
dim_companies {
string company_key PK
string company_name UK
string industry
string company_size_band
string hq_country
}
dim_billing_tiers {
string billing_tier_key PK
string tier_name UK
numeric monthly_price_usd
int rate_limit_rpm
boolean has_priority_access
}
dim_models {
string model_key PK
string model_id UK
string model_family
numeric cost_per_input_token
numeric cost_per_output_token
int context_window_k
boolean is_active
}
dim_date {
int date_key PK
date full_date
string day_of_week
boolean is_weekend
int week_of_year
string month_name
string quarter
int year
}
dim_endpoints {
string endpoint_key PK
string endpoint_path UK
string api_version
boolean is_deprecated
} with users as (
select * from {{ ref('stg_users') }}
),
companies as (
select
company_name,
-- Pick the most common industry per company to avoid fan-out
mode() within group (order by industry) as industry
from users
group by company_name
)
select
{{ dbt_utils.generate_surrogate_key(['company_name']) }} as company_key,
company_name,
industry,
case
when length(company_name) < 15 then 'Small'
when length(company_name) < 25 then 'Medium'
else 'Large'
end as company_size_band,
'US' as hq_country
from companies
The schema tradeoff is measurable. Below are real EXPLAIN ANALYZE
results comparing the star schema (denormalized) against the
snowflake schema (normalized) for the same business question:
"Total cost by industry."
HashAggregate (rows=5)
└─ Hash Join (cost=0.16..19544) ← O(1) hash probe per row
└─ Gather Merge
└─ Parallel Seq Scan ← O(n) scan, n=500K
on fct_api_requests Workers: 2 HashAggregate (rows=10)
└─ Hash Join (cost=427..25411) ← join #3: users → companies
└─ Hash Join ← join #2: fact → users
└─ Hash Join ← join #1: fact scan
└─ Seq Scan on fct_api_requests_sf
(rows=500K) O(n) full scan, no parallelism HashAggregate (rows=10)
└─ Hash Join (cost=233..19760) ← single join: fact → dim_users
└─ Gather Merge
└─ Parallel Seq Scan ← O(n) scan, n=500K
on fct_api_requests Workers: 2, parallelized TAKEAWAY
Same business question — "total cost by industry" — answered in 51ms (star, 1 join) vs 206ms (snowflake, 3 joins). The denormalized star schema is 4× faster because:
- + Fewer hash table builds (1 vs 3) — reduces memory pressure
- + Parallel seq scan enabled — O(n/p) vs O(n)
- + No join cascading — each additional join adds O(k) hash build overhead
This is the Kimball tradeoff: storage redundancy buys query speed. The snowflake schema earns its cost when dimension tables are large, frequently updated, or when write-path consistency matters more than read performance.
Full dbt-generated documentation including lineage graph, column-level descriptions, and test results.
[OPEN DBT DOCS]