Data Warehousing
Updated June 8, 2026Your production Postgres database handles thousands of transactions per second. It's optimized for fast inserts, point lookups, and small updates. Now the CEO wants a report: total revenue by product category over the last five years, broken down by customer region. That query will scan hundreds of millions of rows, do multiple table joins, and run for minutes. It will starve the connection pool and slow down the application for everyone.
You need a separate system built for analytical queries. That system is a data warehouse.
OLTP vs. OLAP
The distinction comes down to two fundamentally different query patterns.
OLTP (Online Transaction Processing): many small, fast operations on individual records. An e-commerce checkout, a bank transfer, a user profile update. Row-oriented storage makes this fast. You insert a row and read a row.
OLAP (Online Analytical Processing): a few large queries that scan millions of rows and compute aggregates. "What was our revenue per region last quarter?" You don't need the whole row. You need one or two columns across the entire table.
This is why warehouses use columnar storage. Data is stored column by column, not row by row. A query for SUM(revenue) reads only the revenue column off disk, skipping every other field. For wide tables with 50+ columns, this can reduce I/O by 95% compared to row-based storage.
Columnar formats (Parquet is the most common) also compress well. Values in the same column tend to be similar. A column of country codes compresses to almost nothing with dictionary encoding.
The Star Schema
Data warehouses organize data into a star schema: a central fact table surrounded by dimension tables.
A fact table contains measurements and foreign keys. Each row is an event or transaction. For an e-commerce warehouse:
- Fact table:
orders— one row per order, withorder_id,customer_id,product_id,date_id,revenue,quantity
Dimension tables contain descriptive attributes you join to:
dim_customers— customer name, region, segment, acquisition channeldim_products— product name, category, brand, costdim_dates— full date, day of week, month, quarter, year, is_holiday
With this structure, the query "revenue by product category last quarter" is:
SELECT p.category, SUM(o.revenue)
FROM orders o
JOIN dim_products p ON o.product_id = p.product_id
JOIN dim_dates d ON o.date_id = d.date_id
WHERE d.quarter = 'Q1' AND d.year = 2026
GROUP BY p.category;The star schema keeps facts lean (small rows, fast scans) and puts descriptive context in dimensions (large rows, small table, easy to join).
Slowly Changing Dimensions
A dimension value can change over time. A customer moves from New York to Chicago. In a star schema, you have choices:
Type 1 (Overwrite): update the current value. Simple, but you lose history. All historical orders now show Chicago.
Type 2 (Add a row): add a new row to dim_customers with the new address and effective dates. Historical orders still point to the old row (New York). New orders point to the new row (Chicago). History is preserved. Most warehouses use Type 2 for dimensions where history matters.
Modern Cloud Warehouses
Physical warehouses (Teradata, Netezza) required expensive hardware you owned. Modern cloud warehouses eliminated that:
Snowflake: separates storage (S3) from compute (virtual warehouses). You pay for storage at S3 rates and only pay for compute when queries run. Multiple compute clusters can share the same data. Auto-suspend idle clusters automatically. Became the dominant cloud warehouse by 2022.
Google BigQuery: serverless. No clusters to manage. You pay per byte scanned (or flat-rate). Google handles all scaling. Integrates natively with the Google Cloud ecosystem.
Amazon Redshift: the original cloud data warehouse (2012). Node-based clusters with reserved capacity. Redshift Serverless now provides an auto-scaling option. Strong integration with AWS services (S3, Glue, Kinesis).
All three use columnar storage, support standard SQL, and integrate with BI tools like Tableau, Looker, and Metabase.
Cloud warehouse architecture: BI tools query the compute layer, which scans columnar storage and caches results. ETL/dbt pipelines load transformed data from source databases.
Result Caching
Warehouse queries are expensive to run but often identical. Snowflake and BigQuery both cache query results. If you run the same query twice within 24 hours and the underlying data hasn't changed, the second run returns the cached result instantly at no compute cost.
This matters for dashboards. A Tableau dashboard that 20 people open in the morning runs the same 5 queries. Without caching, that's 100 expensive scans. With caching, 20 users each see instant results after the first person loads the dashboard.
When Not to Use a Warehouse
A warehouse is the wrong tool when:
- You need sub-millisecond latency on individual record lookups (use a cache or OLTP database)
- Your data is unstructured (images, audio, raw JSON blobs) — use a data lake
- You need real-time aggregations with sub-second freshness (use a stream processor like Flink)
Warehouses are optimized for periodic analytical queries over historical, structured data. For everything else, there's a more appropriate tool.
Summary
A data warehouse stores structured, historical data in columnar format and is optimized for large analytical queries. OLAP workloads scan millions of rows to compute aggregates; OLTP workloads read and write individual records. Star schema organizes data into fact tables (measurements, foreign keys) and dimension tables (descriptive attributes). Modern cloud warehouses like Snowflake, BigQuery, and Redshift separate storage from compute, support result caching, and scale independently of each other. Data flows from operational systems through ETL or ELT pipelines into the warehouse, where BI tools and data scientists run SQL queries for reporting and analysis.
How helpful was this content?
Comments
Sign in to join the discussion
Saved on this device only
Sign in to sync progress across devices