Overview
A Data Lake ingests raw data of any format into a low-cost object store with schema-on-read flexibility. It serves as the central repository for batch and streaming data.
Data Organization
- Landing Zone: Batch (ADF, Glue) & Streaming (Kafka, Event Hubs)
- Raw Zone (Bronze): Immutable files (Parquet, ORC, JSON, CSV, multimedia)
- Curated Zone (Silver): Cleaned, conformed views/tables
- Business Zone (Gold): Aggregated, business-ready datasets
- Data Catalog: Hive Metastore, Glue Catalog, Azure Purview
Key Components & Patterns
Components
- Object Storage: S3, ADLS Gen2
- Compute Engines: Spark, Presto/Trino, EMR
- Metadata Management: Automated crawling & profiling
- Security & Governance: ACLs, encryption, IAM
- Table Formats: Delta Lake, Iceberg, Hudi
Patterns
- Bronze→Silver→Gold medallion layers
- Schema-on-Read flexibility
- Batch & Streaming integration
- Incremental Loads via partitions
- Sandboxing for prototyping
Use Cases
- Log Analytics: Fraud detection (clickstreams → ADLS Gen2)
- IoT Telemetry: Predictive maintenance on sensor data
- Data Science Sandboxes: Exploratory ML on raw datasets
Pros & Cons
Pros
- ✅ Ultra-low cost storage & massive scale
- ✅ Rapid ingestion & experimentation
Cons
- ⚠️ Requires governance & catalog
- ⚠️ Query performance needs optimization
Day-to-Day Operations
- Staging Offload: ETL staging on object storage
- Sandbox Provisioning: Quick Spark/Presto clusters
- Batch Window: Scheduled off-hours loads
- Raw Data Retention: Unlimited storage
- Incremental Loads: Partitioned folders
- Swamp Prevention: Metadata tagging & retention