Overview
A Data Lakehouse unifies data lake flexibility and data warehouse reliability with ACID transactions on open table formats.
Data Organization
- Bronze Layer: Raw ingests (append-only)
- Silver Layer: Cleansed & conformed tables
- Gold Layer: Aggregated, business-ready tables
- Table Formats: Delta Lake, Iceberg, Hudi
- Metadata Catalog: Hive Metastore or catalog store
Key Components & Patterns
Components
- Table Formats: Delta, Iceberg, Hudi for ACID & time travel
- Streaming Engine: Spark Structured Streaming
- Object Storage: S3, ADLS
- Batch Engine: Spark, Presto, Athena
- Time Travel: Historical version queries
Patterns
- ACID Transactions reliable updates
- Medallion Layers Bronze→Silver→Gold
- Unified Workloads BI, ML, streaming
- Compaction small-file optimization
- Schema Evolution add/rename columns
Use Cases
- Feature Store: Real-time ML features at Uber scale
- Unified Analytics: BI + data science on same data
- Time Travel Debugging: Historical data audits
Pros & Cons
Pros
- ✅ Unified platform for BI & ML
- ✅ ACID & time travel on open storage
Cons
- ⚠️ Performance vs dedicated warehouse
- ⚠️ Requires expertise in table formats
Day-to-Day Operations
- ACID Operations: Updates/deletes via SQL
- Compaction Jobs: Background file merging
- Time Travel Queries: Debug historical data
- Schema Evolution: Online adds/renames
- Hybrid Workloads: Mix batch, streaming, BI