Data Lakehouse

Overview

A Data Lakehouse unifies data lake flexibility and data warehouse reliability with ACID transactions on open table formats.

Data Organization

Bronze Layer: Raw ingests (append-only)
Silver Layer: Cleansed & conformed tables
Gold Layer: Aggregated, business-ready tables
Table Formats: Delta Lake, Iceberg, Hudi
Metadata Catalog: Hive Metastore or catalog store

Key Components & Patterns

Components

Table Formats: Delta, Iceberg, Hudi for ACID & time travel
Streaming Engine: Spark Structured Streaming
Object Storage: S3, ADLS
Batch Engine: Spark, Presto, Athena
Time Travel: Historical version queries

Patterns

ACID Transactions reliable updates
Medallion Layers Bronze→Silver→Gold
Unified Workloads BI, ML, streaming
Compaction small-file optimization
Schema Evolution add/rename columns

Use Cases

Feature Store: Real-time ML features at Uber scale
Unified Analytics: BI + data science on same data
Time Travel Debugging: Historical data audits

Pros & Cons

Pros

✅ Unified platform for BI & ML
✅ ACID & time travel on open storage

Cons

⚠️ Performance vs dedicated warehouse
⚠️ Requires expertise in table formats

Day-to-Day Operations

ACID Operations: Updates/deletes via SQL
Compaction Jobs: Background file merging
Time Travel Queries: Debug historical data
Schema Evolution: Online adds/renames
Hybrid Workloads: Mix batch, streaming, BI

🏠 Back to Home