ETL Pipelines

Updated June 8, 2026

Magic Magnets Team

8 min read

Raw data is messy. Customer records live in a Postgres database. Payment data lives in Stripe. Clickstream data lands in S3 log files. If a business analyst wants to see total revenue broken down by customer segment, they can't query three disconnected systems at once.

An ETL pipeline solves this by automatically pulling data from every source, cleaning and reshaping it, and depositing it in one place where analysts can run queries.

ETL stands for Extract, Transform, Load. It's the plumbing that makes data warehouses and business intelligence tools work.

Extract

The extract phase connects to each source system and pulls the raw data. Sources are anything: relational databases, third-party APIs, flat files in object storage, event logs, SaaS products.

Two patterns dominate:

Full extraction: pull everything from the source on every run. Simple, but expensive for large tables. Works fine for small datasets.

Incremental extraction: pull only records that changed since the last run, typically using a updated_at timestamp column or a change data capture (CDC) mechanism. CDC taps into the database's write-ahead log and streams row-level changes in near real-time.

Common tools: Fivetran and Airbyte are popular managed connectors. They handle pagination, retries, API rate limits, and schema changes so you don't have to.

Transform

The transform phase is where the real work happens. Raw data from different sources uses different conventions, formats, and schemas. The transform step makes it consistent.

Common operations:

Deduplication: remove records that appear in multiple sources with different IDs
Normalization: convert all timestamps to UTC, all currencies to USD
Joining: attach a Stripe customer ID to the Postgres user record using email as the join key
Aggregation: compute "total spent this month" from individual transaction rows
Filtering: drop internal test accounts before loading to the warehouse

This phase runs on a processing engine like Spark for large volumes, or inside the warehouse itself using SQL (see ELT below).

Load

The load phase writes the cleaned data to the destination, usually a data warehouse like Snowflake, BigQuery, or Redshift. Warehouse loads are typically bulk inserts or upserts (insert new rows, update existing ones).

After loading, analysts can query the unified dataset with standard SQL. Dashboards pull from here. Machine learning features are computed from here. Reporting runs from here.

algobase.dev

ETL pipeline: an ETL processor (Spark job or dbt run) pulls data from multiple sources on a schedule, applies transformations (cleaning, joining, formatting), and loads the unified result into a data warehouse. Analysts query the warehouse directly for dashboards and reports.

1 / 1

ETL pipeline pulling from three sources, transforming, and loading into a data warehouse

ETL vs. ELT

The classic order is Extract, Transform, Load. But modern cloud data warehouses changed the calculus.

Historically, transformations ran on a separate ETL server before loading, because warehouses were expensive and slow. Running heavy SQL inside a warehouse cost real money.

Today, Snowflake and BigQuery can run massive SQL queries cheaply at scale. So many teams flip the order: Extract, Load, Transform (ELT).

In ELT:

Extract raw data from sources
Load it directly into the warehouse as-is
Transform it inside the warehouse using SQL

dbt (Data Build Tool) is the dominant tool here. It manages SQL transformations as versioned files with dependencies, tests, and documentation. A dbt run executes your SQL models in the right order, builds intermediate tables, and validates results.

ELT with dbt has become the default for modern data teams because:

Raw data is always preserved in the warehouse (no data loss during transform bugs)
SQL is simpler to write and debug than Spark jobs for most analysts
Warehouse compute costs have dropped to the point that running transforms inside is affordable

ETL still makes sense when transformations are computationally intensive (ML feature engineering, large joins across petabyte-scale tables) or when loading raw data to the warehouse first would violate compliance rules.

Scheduling and Orchestration

ETL pipelines run on a schedule: hourly, nightly, or triggered by events. Orchestration tools manage this.

Apache Airflow is the most widely used. A pipeline is a DAG (directed acyclic graph) of tasks, each running a Python function or a bash command. Airflow handles scheduling, retries, dependency ordering, and provides a UI to monitor runs.

Prefect and Dagster are newer alternatives with better developer ergonomics. They're type-aware, support dynamic pipelines, and handle failures more gracefully than Airflow.

Failure Handling

ETL pipelines fail. APIs return errors. Databases go down mid-extract. Partial loads corrupt downstream tables.

Patterns to handle this:

Idempotency: running the pipeline twice should produce the same result as running it once. Use upserts (not inserts) in the load phase.
Checkpointing: track how far extraction got so a retry can resume rather than restart.
Dead-letter queues: records that fail transformation are written to a separate table for inspection rather than dropped silently.
Alerting: Airflow and dbt both integrate with PagerDuty, Slack, and email to notify the team when a run fails.

Summary

ETL pipelines extract raw data from scattered sources, apply transformations to clean and unify it, and load the result into a centralized warehouse. The extract phase uses scheduled pulls or CDC. The transform phase deduplicates, normalizes, joins, and aggregates. The load phase bulk-inserts into a warehouse. Modern teams often use ELT instead, loading raw data first and transforming it inside the warehouse with dbt. Airflow and Prefect handle scheduling and orchestration. Idempotent loads and checkpointing keep the pipeline reliable when individual steps fail.

Data Lakes

How helpful was this content?

Comments

0/2000

Saved on this device only