Lakehouse

Lakehouse Architecture¶

Combining the flexibility of data lakes with the performance of data warehouses.

Overview¶

The Lakehouse is a modern data architecture that combines the cost-effective storage of data lakes with the performance and capabilities of data warehouses. It provides a single source of truth for all data types while maintaining both flexibility and performance.

What is a Lakehouse?¶

Lakehouse = Data lake storage + warehouse capabilities

graph LR
    subgraph "Traditional Warehouse"
        A1[Source] --> B1[ETL Pipeline]
        B1 --> C1[Warehouse]
        C1 --> D1[Analytics]
        style C1 fill:#ffccbc
    end

    subgraph "Modern Lakehouse"
        A2[Source] --> B2[Ingestion]
        B2 --> C2[Raw Layer<br/>Lake Storage]
        C2 --> D2[Curated Layer<br/>Warehouse Capabilities]
        D2 --> E2[Serving Layer]
        E2 --> F2[Analytics]
        E2 --> G2[ML]
        E2 --> H2[APIs]
        style C2 fill:#b2dfdb
        style D2 fill:#80deea
    end

    A1 -.Evolution.-> A2

Shift from monolithic warehouses to modern lakehouse architectures.

Key Characteristics¶

Open formats - Parquet, Delta, Iceberg (not proprietary)
ACID transactions - Reliable updates, deletes
Schema enforcement - Data quality at write time
Time travel - Query historical versions
Upserts - Update existing records efficiently

Benefits¶

1. Cost Efficiency¶

Storage: Object storage (S3, GCS) is 10x cheaper than warehouse storage
Compute: Pay only for queries, not idle time
Lifecycle: Move old data to cheaper tiers automatically

2. Flexibility¶

Multiple engines: Query with Spark, Presto, BigQuery, Snowflake
Multiple formats: Support structured, semi-structured, unstructured
Schema evolution: Handle changing schemas gracefully

3. Performance¶

Columnar formats: Parquet, Delta for fast analytics
Partitioning: Query pruning for efficiency
Indexing: Fast lookups when needed

4. Single Source of Truth¶

No duplication: One copy of data, multiple access patterns
Consistency: Same data for all consumers
Lineage: Clear data flow

Lakehouse Formats¶

Delta Lake¶

Best for: ACID transactions, time travel, upserts

Pros: - ✅ ACID transactions - ✅ Time travel (query historical versions) - ✅ Upserts, deletes - ✅ Schema evolution - ✅ Metadata optimization

Cons: - ❌ Requires compatible engines - ❌ More complex than Parquet

Use when: - Need updates/deletes - Time travel required - Concurrent writes

Apache Iceberg¶

Best for: Open format, multi-engine support

Pros: - ✅ Open format (not vendor-specific) - ✅ Multi-engine support - ✅ Good performance - ✅ Partition evolution

Cons: - ❌ Less mature than Delta - ❌ Smaller ecosystem

Use when: - Want open format - Multi-engine requirements - Avoiding vendor lock-in

Apache Hudi¶

Best for: Real-time updates, incremental processing

Pros: - ✅ Real-time updates - ✅ Incremental processing - ✅ Good for streaming

Cons: - ❌ Less mature - ❌ Smaller ecosystem

Use when: - Real-time updates needed - Streaming use cases

Architecture Layers¶

1. Raw Layer (Bronze)¶

Purpose: Preserve source data exactly as received

Characteristics: - Immutable (append-only) - Schema-on-read - Long retention (7 years) - Partitioned by ingestion time

Format: Parquet, JSON, Avro

2. Curated Layer (Silver)¶

Purpose: Cleaned, validated, enriched data

Characteristics: - Schema-on-write - Quality checks applied - Enriched with reference data - Partitioned by business keys

Format: Delta Lake, Iceberg, Parquet

3. Serving Layer (Gold)¶

Purpose: Analysis-ready, aggregated data

Characteristics: - Optimized for queries - Pre-aggregated - Denormalized - Indexed

Format: Delta Lake, Iceberg, or warehouse tables

Implementation¶

Example: Delta Lake on S3¶

# Write to Delta Lake
df.write.format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .partitionBy("date") \
    .save("s3://lakehouse/curated/events")

# Query with Spark
spark.read.format("delta") \
    .load("s3://lakehouse/curated/events") \
    .filter("date = '2024-01-15'") \
    .show()

# Upsert
from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "s3://lakehouse/curated/events")
deltaTable.alias("target").merge(
    updatesDF.alias("source"),
    "target.id = source.id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()

Example: Iceberg on GCS¶

# Write to Iceberg
df.write.format("iceberg") \
    .mode("overwrite") \
    .option("write-format", "parquet") \
    .partitionBy("date") \
    .save("gs://lakehouse/curated/events")

# Query with Spark
spark.read.format("iceberg") \
    .load("gs://lakehouse/curated/events") \
    .filter("date = '2024-01-15'") \
    .show()

Best Practices¶

1. Partitioning Strategy¶

Time-based partitioning:

events/
  date=2024-01-15/
    data.parquet

Benefits: - Query pruning - Lifecycle management - Parallel processing

2. Schema Evolution¶

Backward-compatible changes: - Add optional fields - Make required fields optional - Never remove fields (deprecate instead)

3. Lifecycle Management¶

Automatically move old data: - Hot (0-30 days): Active queries - Warm (30-90 days): Occasional queries - Cold (90+ days): Archive, compliance

4. Metadata Management¶

Track: - Schema versions - Partition information - Statistics (min/max values) - Lineage

Comparison: Lakehouse vs Alternatives¶

Aspect	Data Lake	Data Warehouse	Lakehouse
Storage Cost	Low	High	Low
Query Performance	Medium	High	High
Schema Flexibility	High	Low	Medium
ACID Transactions	No	Yes	Yes
Time Travel	No	Limited	Yes
Multi-Engine	Yes	No	Yes

When to Use Lakehouse¶

Use Lakehouse when: - ✅ Need cost-effective storage - ✅ Need fast queries - ✅ Need schema flexibility - ✅ Need ACID transactions - ✅ Want to avoid vendor lock-in

Don't use Lakehouse when: - ❌ Simple use case (warehouse is enough) - ❌ No engineering resources (use managed warehouse) - ❌ Small scale (warehouse is simpler)

Storage - Storage fundamentals
Data Ingestion - Getting data into lakehouse
Data Processing - Processing lakehouse data

Next: Ingestion Architecture →