Data Lake

Overview

A Data Lake ingests raw data of any format into a low-cost object store with schema-on-read flexibility. It serves as the central repository for batch and streaming data.

Data Organization

Landing Zone: Batch (ADF, Glue) & Streaming (Kafka, Event Hubs)
Raw Zone (Bronze): Immutable files (Parquet, ORC, JSON, CSV, multimedia)
Curated Zone (Silver): Cleaned, conformed views/tables
Business Zone (Gold): Aggregated, business-ready datasets
Data Catalog: Hive Metastore, Glue Catalog, Azure Purview

Key Components & Patterns

Components

Object Storage: S3, ADLS Gen2
Compute Engines: Spark, Presto/Trino, EMR
Metadata Management: Automated crawling & profiling
Security & Governance: ACLs, encryption, IAM
Table Formats: Delta Lake, Iceberg, Hudi

Patterns

Bronze→Silver→Gold medallion layers
Schema-on-Read flexibility
Batch & Streaming integration
Incremental Loads via partitions
Sandboxing for prototyping

Use Cases

Log Analytics: Fraud detection (clickstreams → ADLS Gen2)
IoT Telemetry: Predictive maintenance on sensor data
Data Science Sandboxes: Exploratory ML on raw datasets

Pros & Cons

Pros

✅ Ultra-low cost storage & massive scale
✅ Rapid ingestion & experimentation

Cons

⚠️ Requires governance & catalog
⚠️ Query performance needs optimization

Day-to-Day Operations

Staging Offload: ETL staging on object storage
Sandbox Provisioning: Quick Spark/Presto clusters
Batch Window: Scheduled off-hours loads
Raw Data Retention: Unlimited storage
Incremental Loads: Partitioned folders
Swamp Prevention: Metadata tagging & retention

🏠 Back to Home