Overview

Modern Table Formats bring ACID transactions, schema evolution, and time travel to data lakes, enabling lakehouse architectures.

Format Comparison

Delta Lake

Origin: Databricks (2019)

Storage: Parquet + transaction log

Strengths: Mature ecosystem, streaming

Best For: Spark-heavy workloads

Apache Iceberg

Origin: Netflix β†’ Apache (2018)

Storage: Any format + metadata tree

Strengths: Engine agnostic, hidden partitioning

Best For: Multi-engine environments

Apache Hudi

Origin: Uber β†’ Apache (2016)

Storage: Base + log files

Strengths: Incremental processing, CDC

Best For: Real-time updates

Key Capabilities

Feature Delta Lake Apache Iceberg Apache Hudi
ACID Transactions βœ… Full ACID βœ… Full ACID βœ… Full ACID
Time Travel βœ… Version & timestamp βœ… Snapshot-based βœ… Point-in-time
Schema Evolution βœ… Add/rename columns βœ… Full evolution βœ… Schema registry
Hidden Partitioning ❌ Manual partitioning βœ… Automatic βœ… Timeline-based
Streaming Support βœ… Native streaming βœ… Via Flink/Spark βœ… Real-time ingestion
Engine Support Spark, Presto, Athena Spark, Flink, Trino, Athena Spark, Presto, Hive

Architecture Patterns

Common Patterns

  • Medallion Architecture Bronzeβ†’Silverβ†’Gold
  • Compaction Jobs Small file optimization
  • Vacuum Operations Old version cleanup
  • Z-Ordering Data clustering
  • Liquid Clustering Auto-optimization

Use Case Fit

  • Delta Lake: Databricks ecosystem, streaming
  • Iceberg: Multi-cloud, engine flexibility
  • Hudi: CDC, incremental processing
  • All Formats: Replace Hive tables

Implementation Considerations

Key Decisions
  • File Size: Target 128MB-1GB per file
  • Partitioning: Balance query performance vs small files
  • Compaction: Schedule during low-usage windows
  • Retention: Configure vacuum/expire policies
  • Clustering: Choose columns for Z-order/liquid clustering

Real-World Examples

🏠 Back to Home