Lakehouse
Lakehouse Architecture¶
Combining the flexibility of data lakes with the performance of data warehouses.
Overview¶
The Lakehouse is a modern data architecture that combines the cost-effective storage of data lakes with the performance and capabilities of data warehouses. It provides a single source of truth for all data types while maintaining both flexibility and performance.
What is a Lakehouse?¶
Lakehouse = Data lake storage + warehouse capabilities
graph LR
subgraph "Traditional Warehouse"
A1[Source] --> B1[ETL Pipeline]
B1 --> C1[Warehouse]
C1 --> D1[Analytics]
style C1 fill:#ffccbc
end
subgraph "Modern Lakehouse"
A2[Source] --> B2[Ingestion]
B2 --> C2[Raw Layer<br/>Lake Storage]
C2 --> D2[Curated Layer<br/>Warehouse Capabilities]
D2 --> E2[Serving Layer]
E2 --> F2[Analytics]
E2 --> G2[ML]
E2 --> H2[APIs]
style C2 fill:#b2dfdb
style D2 fill:#80deea
end
A1 -.Evolution.-> A2
Shift from monolithic warehouses to modern lakehouse architectures.
Key Characteristics¶
- Open formats - Parquet, Delta, Iceberg (not proprietary)
- ACID transactions - Reliable updates, deletes
- Schema enforcement - Data quality at write time
- Time travel - Query historical versions
- Upserts - Update existing records efficiently
Benefits¶
1. Cost Efficiency¶
- Storage: Object storage (S3, GCS) is 10x cheaper than warehouse storage
- Compute: Pay only for queries, not idle time
- Lifecycle: Move old data to cheaper tiers automatically
2. Flexibility¶
- Multiple engines: Query with Spark, Presto, BigQuery, Snowflake
- Multiple formats: Support structured, semi-structured, unstructured
- Schema evolution: Handle changing schemas gracefully
3. Performance¶
- Columnar formats: Parquet, Delta for fast analytics
- Partitioning: Query pruning for efficiency
- Indexing: Fast lookups when needed
4. Single Source of Truth¶
- No duplication: One copy of data, multiple access patterns
- Consistency: Same data for all consumers
- Lineage: Clear data flow
Lakehouse Formats¶
Delta Lake¶
Best for: ACID transactions, time travel, upserts
Pros: - ✅ ACID transactions - ✅ Time travel (query historical versions) - ✅ Upserts, deletes - ✅ Schema evolution - ✅ Metadata optimization
Cons: - ❌ Requires compatible engines - ❌ More complex than Parquet
Use when: - Need updates/deletes - Time travel required - Concurrent writes
Apache Iceberg¶
Best for: Open format, multi-engine support
Pros: - ✅ Open format (not vendor-specific) - ✅ Multi-engine support - ✅ Good performance - ✅ Partition evolution
Cons: - ❌ Less mature than Delta - ❌ Smaller ecosystem
Use when: - Want open format - Multi-engine requirements - Avoiding vendor lock-in
Apache Hudi¶
Best for: Real-time updates, incremental processing
Pros: - ✅ Real-time updates - ✅ Incremental processing - ✅ Good for streaming
Cons: - ❌ Less mature - ❌ Smaller ecosystem
Use when: - Real-time updates needed - Streaming use cases
Architecture Layers¶
1. Raw Layer (Bronze)¶
Purpose: Preserve source data exactly as received
Characteristics: - Immutable (append-only) - Schema-on-read - Long retention (7 years) - Partitioned by ingestion time
Format: Parquet, JSON, Avro
2. Curated Layer (Silver)¶
Purpose: Cleaned, validated, enriched data
Characteristics: - Schema-on-write - Quality checks applied - Enriched with reference data - Partitioned by business keys
Format: Delta Lake, Iceberg, Parquet
3. Serving Layer (Gold)¶
Purpose: Analysis-ready, aggregated data
Characteristics: - Optimized for queries - Pre-aggregated - Denormalized - Indexed
Format: Delta Lake, Iceberg, or warehouse tables
Implementation¶
Example: Delta Lake on S3¶
# Write to Delta Lake
df.write.format("delta") \
.mode("overwrite") \
.option("mergeSchema", "true") \
.partitionBy("date") \
.save("s3://lakehouse/curated/events")
# Query with Spark
spark.read.format("delta") \
.load("s3://lakehouse/curated/events") \
.filter("date = '2024-01-15'") \
.show()
# Upsert
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "s3://lakehouse/curated/events")
deltaTable.alias("target").merge(
updatesDF.alias("source"),
"target.id = source.id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
Example: Iceberg on GCS¶
# Write to Iceberg
df.write.format("iceberg") \
.mode("overwrite") \
.option("write-format", "parquet") \
.partitionBy("date") \
.save("gs://lakehouse/curated/events")
# Query with Spark
spark.read.format("iceberg") \
.load("gs://lakehouse/curated/events") \
.filter("date = '2024-01-15'") \
.show()
Best Practices¶
1. Partitioning Strategy¶
Time-based partitioning:
Benefits: - Query pruning - Lifecycle management - Parallel processing
2. Schema Evolution¶
Backward-compatible changes: - Add optional fields - Make required fields optional - Never remove fields (deprecate instead)
3. Lifecycle Management¶
Automatically move old data: - Hot (0-30 days): Active queries - Warm (30-90 days): Occasional queries - Cold (90+ days): Archive, compliance
4. Metadata Management¶
Track: - Schema versions - Partition information - Statistics (min/max values) - Lineage
Comparison: Lakehouse vs Alternatives¶
| Aspect | Data Lake | Data Warehouse | Lakehouse |
|---|---|---|---|
| Storage Cost | Low | High | Low |
| Query Performance | Medium | High | High |
| Schema Flexibility | High | Low | Medium |
| ACID Transactions | No | Yes | Yes |
| Time Travel | No | Limited | Yes |
| Multi-Engine | Yes | No | Yes |
When to Use Lakehouse¶
Use Lakehouse when: - ✅ Need cost-effective storage - ✅ Need fast queries - ✅ Need schema flexibility - ✅ Need ACID transactions - ✅ Want to avoid vendor lock-in
Don't use Lakehouse when: - ❌ Simple use case (warehouse is enough) - ❌ No engineering resources (use managed warehouse) - ❌ Small scale (warehouse is simpler)
Related Topics¶
- Storage - Storage fundamentals
- Data Ingestion - Getting data into lakehouse
- Data Processing - Processing lakehouse data
Next: Ingestion Architecture →