Overview

Modern data architectures use specific optimization patterns to ensure performance, cost efficiency, and maintainability.

1. Medallion Architecture (Bronze→Silver→Gold)

What it is: A data organization pattern that progressively refines data through three layers.

🥉 Bronze Layer
  • Raw, unprocessed data
  • Exact copy from source
  • Append-only, immutable
  • All formats (JSON, CSV, Parquet)
🥈 Silver Layer
  • Cleaned & validated data
  • Standardized schemas
  • Deduplication applied
  • Business rules enforced
🥇 Gold Layer
  • Business-ready datasets
  • Aggregated metrics
  • Optimized for consumption
  • BI & ML ready
Why use it: Provides clear data lineage, enables incremental processing, and separates concerns between data engineering and business logic.

2. Compaction Jobs

What it is: Background processes that merge small files into larger, optimally-sized files.

The Problem
  • Streaming creates many small files
  • Small files = poor query performance
  • Metadata overhead increases
  • Storage inefficiency
The Solution
  • Merge 1000s of small files → few large files
  • Target: 128MB - 1GB per file
  • Run during low-usage windows
  • Automatic in modern table formats
Example:
Before: 10,000 files × 1MB = 10GB (slow queries)
After: 10 files × 1GB = 10GB (fast queries)

3. Vacuum Operations

What it is: Cleanup processes that remove old file versions and unused data files.

What Gets Cleaned
  • Old file versions (after updates/deletes)
  • Uncommitted transaction files
  • Temporary processing files
  • Expired time-travel snapshots
Configuration
  • Retention: Keep 7-30 days of history
  • Schedule: Daily/weekly cleanup
  • Safety: Never delete active files
  • Cost: Reduces storage costs significantly
Important: Vacuum operations are irreversible. Once old versions are deleted, time-travel queries to those points will fail.

4. Z-Ordering (Data Clustering)

What it is: A technique that physically reorganizes data to co-locate related information, dramatically improving query performance.

How It Works
  • Maps multi-dimensional data to 1D
  • Uses space-filling Z-curve algorithm
  • Co-locates related data in same files
  • Reduces data scanning during queries
When to Use
  • Queries filter on multiple columns
  • Range queries (date ranges, numeric ranges)
  • Join operations on clustered columns
  • Large tables with selective queries
Example:
Table: customer_orders (customer_id, order_date, amount)
Z-Order by: customer_id, order_date
Result: Orders for same customer + date range stored together
Query: "Orders for customer 123 in March" → scans 1 file instead of 100
Performance Impact: Can improve query performance by 10-100x for selective queries on large tables.

5. Liquid Clustering (Auto-Optimization)

What it is: Next-generation clustering that automatically optimizes data layout without manual intervention.

Key Features
  • Automatic: No manual OPTIMIZE commands
  • Incremental: Only processes changed data
  • Adaptive: Learns from query patterns
  • Cost-Effective: Optimizes during writes
Advantages over Z-Order
  • No expensive OPTIMIZE operations
  • Handles changing query patterns
  • Better for streaming workloads
  • Supports unlimited clustering keys
Setup Example:
CREATE TABLE sales (
  customer_id INT,
  order_date DATE,
  amount DECIMAL
) CLUSTER BY (customer_id, order_date);

// Data automatically clusters as it's written
Best Practice: Choose clustering keys based on your most common query filters. Start with 2-3 columns that appear frequently in WHERE clauses.

Pattern Selection Guide

Use Case Recommended Pattern Why
New data lake setup Medallion Architecture Provides structure and governance
Streaming data ingestion Compaction + Liquid Clustering Handles small files automatically
Large analytical tables Z-Ordering Dramatic query performance improvement
Cost optimization Vacuum Operations Reduces storage costs by 30-70%
Modern lakehouse All patterns combined Maximum performance and efficiency
🏠 Back to Home