Data Lakes

Updated June 8, 2026

Magic Magnets Team

8 min read

Companies generate far more data than can fit in a relational database. Server logs, raw sensor readings, video files, JSON event streams, user-uploaded images: none of this fits cleanly in structured rows and columns. And you often don't know in advance which of it will be useful.

A data lake is a centralized storage repository for all of this data, stored in raw format. You ingest first and structure later.

What a Data Lake Is

A data lake stores data exactly as it was produced. No schema is required upfront. JSON stays JSON. A video file stays a video file. A CSV stays a CSV.

The underlying storage is almost always object storage: Amazon S3, Google Cloud Storage, or Azure Data Lake Storage Gen2. Object storage is cheap (fractions of a cent per GB per month), scales to petabytes, and requires no schema. You write a file, get back a key, read it by that key later.

Three properties define a data lake:

Schema-on-read: you impose a schema when you query the data, not when you store it. This is the opposite of a relational database, which requires a schema before any data is written.
Any data type: structured tables, semi-structured JSON and Avro, unstructured binary files (images, audio, video) all coexist.
Massive scale at low cost: object storage costs far less than block storage or managed databases per gigabyte.

File Formats

The format you choose affects how efficiently you can query the data later.

JSON and CSV: human-readable, easy to produce, bad to query. Scanning a 500 GB CSV to find records where country = "US" reads every byte in the file.

Parquet: columnar binary format. Stores data column-by-column rather than row-by-row. If your query only touches 3 columns out of 50, Parquet reads only those columns off disk. Also supports column-level compression and predicate pushdown (skipping entire row groups that can't match your filter). For analytics workloads, Parquet is typically 5-10x faster to scan than CSV and 3-5x smaller on disk.

Avro: row-based binary format with an embedded schema. Good for streaming pipelines where you produce records one at a time. Kafka connectors often serialize events as Avro.

Most modern data lakes store raw data as JSON or CSV (easy to ingest), then convert it to Parquet during the transform step (fast to query).

Zone Architecture

A well-organized data lake uses distinct zones to separate raw data from processed data.

Raw zone (also called the bronze layer): data lands here exactly as received. Never modify or delete it. This is the source of truth. If a downstream job has a bug, you can always re-process from the raw zone.

Cleaned zone (silver layer): Spark or Glue jobs read the raw zone, fix encoding issues, deduplicate records, normalize timestamps, and write the result here in Parquet. Still relatively close to the original data.

Curated zone (gold layer): further aggregated and joined datasets optimized for specific use cases. A marketing dataset might join user events with demographic data and pre-compute 30-day activity metrics. An ML features table might live here.

Query engines like Amazon Athena or Spark SQL read from the curated zone for most analytical queries.

algobase.dev

Data lake zone architecture: raw data lands in S3 unmodified. Spark jobs read the raw zone, clean and partition the data, and write Parquet files to the curated zone. The data catalog (AWS Glue) stores the schema for each dataset. Query engines like Athena read from the curated zone using catalog metadata.

1 / 1

Data lake zone architecture: raw ingestion, Spark transforms, curated Parquet, catalog, and query engine

The Data Catalog

Storing files in S3 is easy. Knowing what those files contain six months later is hard. A data catalog solves this.

The catalog is a metadata store that maps file paths to schemas. For each dataset it tracks: what columns exist, their types, which S3 prefix the files live under, and when the data was last updated. Query engines consult the catalog to understand how to read a file without embedding that knowledge in the query.

AWS Glue Data Catalog is the most common choice on AWS. Glue Crawlers scan S3 prefixes, infer schemas, and register them automatically. Athena, Spark, and EMR all read from the Glue catalog.

Apache Hive Metastore is the open-source equivalent. Databricks and most self-hosted Spark setups use it.

Without a catalog, your data lake becomes a directory full of files with no record of what's in them. That is a data swamp.

Avoiding Data Swamps

A data swamp is what a data lake becomes when it has no governance. Files accumulate with no metadata. Nobody knows which datasets are current and which are stale. Queries return wrong results because the schema changed two months ago and nobody updated the catalog. Data scientists spend half their time figuring out what the data means before they can use it.

Prevention:

Enforce catalog registration: every dataset must be registered in the catalog before it can be queried. No ad-hoc files dropped in S3 without metadata.
Partition by date: store data under prefixes like s3://bucket/events/year=2026/month=06/day=08/. Query engines can prune entire partitions without scanning them.
Data quality checks: run validation after every load. Count rows, check for nulls in required fields, verify value ranges. Fail loudly when something looks wrong.
Lineage tracking: know which upstream sources produced each dataset and which downstream jobs consume it. When a source schema changes, you know exactly what breaks.

Real-World Examples

Netflix

Netflix stores every pause, seek, rating, and search event as raw JSON in S3. Data scientists run Spark jobs over this lake to train the recommendation engine. Having the raw events (rather than pre-aggregated summaries) lets researchers try new feature ideas against full historical data without re-ingesting from source systems.

Autonomous Vehicles

Waymo's vehicles generate terabytes of camera frames, LIDAR point clouds, and radar sweeps per day. This data can't go into a relational database. It goes into a data lake, where ML training pipelines read it directly to train perception models. Storing raw sensor data means engineers can reprocess it with improved models months later.

Data Lake vs. Data Warehouse

These are often confused but serve different purposes.

	Data Lake	Data Warehouse
Storage format	Raw files (JSON, Parquet, images)	Structured tables
Schema	Schema-on-read	Schema-on-write
Users	Data scientists, ML engineers	Analysts, BI tools
Query latency	Slower (scan files)	Faster (optimized indexes)
Cost per GB	Very low	Higher
Best for	ML training, raw exploration	Dashboards, reporting

Many organizations use both: the data lake as the raw archive and ML feature store, the data warehouse for business intelligence queries.

Summary

A data lake stores all data types in raw format in cheap object storage, applies schema at query time, and scales to petabytes. Zone architecture separates raw, cleaned, and curated datasets. Parquet is the preferred format for analytical queries due to columnar storage and predicate pushdown. A data catalog maps file paths to schemas and is essential to avoid a data swamp. Without governance (catalog registration, quality checks, lineage tracking), a data lake degrades into an unusable pile of undocumented files.

Data Warehousing

How helpful was this content?

Comments

0/2000

Saved on this device only