Data Lakehouse

Updated June 8, 2026
M
Magic Magnets Team
8 min read

For years, companies ran two separate data systems in parallel. A data lake on cheap object storage held raw, unstructured data for data scientists and ML pipelines. A data warehouse held clean, structured data for analysts running SQL dashboards. These were separate systems with different tooling, different teams, and different copies of the data.

The operational cost was real. ETL pipelines moved data from the lake to the warehouse. When the lake's schema changed, the warehouse pipeline broke. Storage was duplicated across both systems. A data scientist training a model and an analyst building a dashboard could not run queries on the same data.

The data lakehouse is an architecture that collapses these two systems into one by adding transactional capabilities directly to cheap object storage.

What the Lakehouse Adds

A data lake on S3 is just files. You can't run an atomic transaction on files. If a Spark job fails halfway through writing, you end up with partial data. There's no mechanism to enforce a schema, roll back a bad write, or see a consistent snapshot of the data.

A table format layer solves this. Technologies like Delta Lake, Apache Iceberg, and Apache Hudi sit between your compute engine and the raw Parquet files in S3. They maintain a transaction log: a record of every write operation, which files were added, which were deleted, and when.

This gives you:

  • ACID transactions: writes are atomic. A failed job leaves the table unchanged. Concurrent reads and writes don't corrupt each other.
  • Time travel: query the table as it was at any point in the past. SELECT * FROM orders VERSION AS OF '2026-01-01'. Useful for debugging bad writes, auditing, and ML training on historical snapshots.
  • Schema enforcement: the table format rejects writes that don't match the declared schema, catching pipeline bugs before they corrupt downstream data.
  • Schema evolution: when you need to add a column, the format handles the migration without rewriting existing files.

How Delta Lake Works

Delta Lake is the most widely deployed table format. It adds a _delta_log directory alongside your Parquet files in S3. Each transaction appends a JSON entry to this log listing the files added or removed.

When you read a table, Delta Lake replays the log to determine which Parquet files are current. When you write, it appends a new log entry atomically. Two concurrent writers can't both commit the same version: one wins, the other retries.

The log also stores column statistics for each file: min value, max value, null count. Query engines use these to skip files that can't possibly match a filter. A query for country = "US" can skip every file where max_country < "US". This is called data skipping and dramatically reduces how many files get scanned.

Z-ordering takes this further: physically co-locating records with similar values for a set of columns within each file. If you frequently filter by (country, city), Z-ordering those columns makes your filters skip more files.

Apache Iceberg

Iceberg was developed at Netflix for very large tables (petabytes with billions of files). It solves problems that Delta Lake struggled with at that scale.

Key differences:

  • Hidden partitioning: Iceberg hides partition details from queries. You don't need to write WHERE year = 2026 AND month = 06. Iceberg's planner automatically prunes partitions using column statistics.
  • Partition evolution: you can change how data is partitioned without rewriting files. Delta Lake requires rewriting.
  • Multi-table transactions: Iceberg supports atomic changes across multiple tables in a single commit. Delta Lake operates one table at a time.

Both Delta Lake and Iceberg are open formats: any engine that implements the spec can read the tables. Spark, Trino, Athena, Flink, and DuckDB all support both.

The Lakehouse vs. Separate Lake + Warehouse

Why not keep separate systems? For many teams, that still works fine. But the lakehouse wins on:

No data duplication: one copy of the data in S3, queried directly by both ML pipelines and BI tools.

Consistency: analysts and data scientists see the same data at the same point in time. With separate systems, the warehouse is always N hours behind the lake.

Simpler architecture: fewer systems to operate, monitor, and debug. No ETL pipeline to keep in sync between lake and warehouse.

The trade-off: a dedicated warehouse (Snowflake, BigQuery) is still faster for concurrent BI queries. Warehouses use proprietary internal formats heavily optimized for analytical SQL. Lakehouse query latency on the same workload is usually higher, though the gap is closing.

algobase.dev
Lakehouse architecture: data lands in object storage as Parquet files. The table format layer (Delta Lake or Iceberg) maintains a transaction log that tracks which files are current, enabling ACID commits and time travel. Multiple query engines (Spark SQL, Athena) read the same data using the table format metadata. BI tools connect to any engine without data duplication.
1 / 1

Lakehouse architecture: table format layer enables ACID transactions and time travel on top of object storage. Multiple query engines share the same data without duplication.

When to Use a Lakehouse

The lakehouse model is the right starting point for new data platforms. You get cheap storage, transactional writes, SQL access, and ML-friendly raw data access all from one system.

Stick with separate lake + warehouse if:

  • You have an existing warehouse heavily optimized for BI with invested schema models
  • Your BI query volume is very high and concurrency matters more than simplicity
  • Your team is too small to benefit from consolidation complexity

Summary

A data lakehouse adds a transaction log layer (Delta Lake, Apache Iceberg) on top of cheap object storage, bringing ACID transactions, time travel, schema enforcement, and schema evolution to raw Parquet files. This eliminates the need to maintain a separate data warehouse by making the same data accessible to SQL query engines and ML pipelines without duplication. Delta Lake is the most widely deployed format; Iceberg excels at very large tables and multi-table transactions. Most new data platforms start with a lakehouse architecture today.

Lambda Architecture

How helpful was this content?

Comments

0/2000

Sign in to join the discussion

Saved on this device only

Sign in to sync progress across devices