Home / Blog / From Data Warehouse to Data Lake to Lakehouse: What Actually Changed and Why It Matters
6 min read ArchitectureLakehouseData Engineering

From Data Warehouse to Data Lake to Lakehouse: What Actually Changed and Why It Matters

By Alex

A practical look at how data architecture evolved from rigid warehouses to sprawling lakes and finally to the lakehouse pattern that combines the best of both, and why the distinction matters when you're choosing where to build.

There’s a framing that gets repeated a lot in data engineering: the data warehouse came first, the data lake was the response, and the lakehouse is the synthesis. That’s roughly true, but the real story is more useful than the headline.

The Data Warehouse Era

My introduction to this world was Bill Inmon’s Building the Data Warehouse, the book that defined the discipline and shaped how a generation of engineers thought about organising data. It is dense, opinionated, and worth every page.

Data warehouses (Teradata, then Redshift, Snowflake, BigQuery) solved a real problem: how do you run analytical queries against operational data without killing your transactional database?

The answer was to extract data from source systems, transform it into a structured schema, and load it into a separate, read-optimised store. The ETL pattern. The Kimball star schema. Fact tables and dimensions. These weren’t arbitrary choices; they were engineering solutions to specific constraints: limited storage, expensive compute, and the need for consistent, fast query performance at scale.

Warehouses worked. They still work. But they came with hard trade-offs:

  • Schema on write: you had to know your schema before loading data. Changing it later was painful and often required rebuilding.
  • Proprietary formats: your data lived inside the warehouse’s internal storage, owned by the vendor.
  • Limited data types: structured, tabular data only. Images, audio, and unstructured text didn’t belong.
  • Cost at scale: always-on compute meant you were paying whether queries were running or not.

These trade-offs were acceptable when data volumes were manageable and use cases were predictable: dashboards, scheduled reports, known business metrics.

The Data Lake Experiment

By the mid-2010s, data volumes had grown beyond what warehouses could economically store, and the use cases had broadened. Machine learning needed raw, unprocessed data. Data science teams wanted access to the full event history, not a pre-aggregated summary. Streaming data was arriving in formats no schema-on-write system could handle gracefully.

The data lake was the response. The idea was simple: throw everything into object storage (S3, GCS, ADLS), worry about structure later. Schema on read. Store it cheap, figure out what to do with it when you need it.

In practice, this created a different set of problems:

  • No ACID guarantees: two jobs writing to the same table at the same time would corrupt it. There was no row-level locking, no atomic commits.
  • No data versioning: once you overwrote a partition, the previous state was gone unless you’d built your own snapshotting mechanism.
  • Consistency issues: a reader mid-query could see partial results if a writer was simultaneously updating the table.
  • Governance and quality: data lakes became “data swamps” surprisingly fast. Without tooling to enforce quality and ownership, you ended up with a large, cheap, unreliable mess.
  • Performance: query engines running against raw Parquet files in S3 didn’t have the metadata they needed to prune files efficiently. Full table scans were common.

The lake solved the storage cost problem and the schema flexibility problem. It introduced reliability problems the warehouse had already solved.

The Lakehouse Pattern

The lakehouse, a term popularised by Databricks and the paper that preceded it, is the attempt to resolve that tension. The core idea: keep data in open-format files on object storage (cheap, flexible, vendor-neutral), but layer a transactional metadata protocol on top that brings warehouse-grade reliability.

The two dominant implementations today are Apache Iceberg and Delta Lake, with Apache Hudi as a third option used in specific streaming scenarios.

What these table formats actually add:

ACID transactions. Writes are committed atomically. A reader never sees a partial write. Two writers don’t corrupt each other’s output. This sounds like table stakes (and in a warehouse it was), but it’s what was missing from the raw lake.

Time travel. Every write creates a new snapshot. You can query any previous version of a table. This is genuinely useful: if a pipeline bug corrupts your data, you don’t restore from a backup; you query the previous snapshot and rebuild from there.

Schema evolution. Adding a column, renaming a field, changing a type: all handled cleanly with metadata tracking. No full table rebuilds.

Hidden partitioning and file pruning. The format tracks which files contain which data ranges. Query engines can skip files entirely based on partition filters. This turns full table scans into targeted reads, often a 10x improvement in query cost and latency.

Separation of storage and compute. Your data sits in object storage like S3. Your compute is whatever engine you point at it: Spark, Trino, Athena, Flink, DuckDB. You’re not locked into one vendor’s proprietary format.

What This Means in Practice

I’ve built lakehouses on top of Apache Iceberg at two companies now. The pattern is consistent: raw data lands in object storage like S3 as Iceberg tables, a job scheduler like Glue or Spark handles ingestion and transformation, a query engine like Athena handles ad-hoc queries, and dbt manages the semantic layer.

The concrete difference from a traditional warehouse is felt on two levels. The first is cost: moving away from always-on proprietary compute to a pay-per-query model, combined with Iceberg’s columnar pruning drastically reducing the data scanned per query, consistently brings storage and query costs down in a meaningful way. The second is performance: queries that used to take hours against a poorly partitioned warehouse run in minutes against a well-structured Iceberg table.

But the more important change is architectural. A team can add a new data source without deciding upfront exactly how it will be used. Raw data lands as-is in the raw layer. Transformations clean and shape it in staging. Curated, business-ready models sit on top. If a pipeline breaks, time travel means you can reprocess from any historical snapshot without restoring from backup. Schema changes happen in a migration, no weekend downtime, no manual partition rebuilds.

Which Should You Use?

The honest answer is: it depends on where you’re starting.

If you’re a small team and your data fits comfortably in a managed warehouse, Snowflake or BigQuery is probably still the right call. The operational simplicity is worth the cost and lock-in at that scale.

If you’re on AWS and growing fast, with high data volumes, mixed workloads, and machine learning alongside analytics, Iceberg on S3 with Athena and Glue gives you a platform you’ll never need to migrate off. Everything is open format, everything is yours.

If you’re already on Databricks, Delta Lake is the native choice and deeply integrated with the rest of the platform.

The lakehouse isn’t a magic solution. It adds complexity: you need to manage compaction, snapshot expiry, and schema migrations. But it removes the category of problems that made data lakes unreliable, while keeping the cost and flexibility advantages that made warehouses hard to scale.

The shift happened because the constraints changed. Storage got cheap. Compute got elastic. ML workloads arrived that warehouses weren’t designed for. The lakehouse is what good engineering looks like when you design for those constraints, not the constraints of 2005.

Building a data platform?

Free discovery call. Tell me where your stack is today and where you need it to go.