Architecture Patterns Explained

Overview

Modern data architectures use specific optimization patterns to ensure performance, cost efficiency, and maintainability.

1. Medallion Architecture (Bronze→Silver→Gold)

What it is: A data organization pattern that progressively refines data through three layers.

🥉 Bronze Layer

Raw, unprocessed data
Exact copy from source
Append-only, immutable
All formats (JSON, CSV, Parquet)

🥈 Silver Layer

Cleaned & validated data
Standardized schemas
Deduplication applied
Business rules enforced

🥇 Gold Layer

Business-ready datasets
Aggregated metrics
Optimized for consumption
BI & ML ready

Why use it: Provides clear data lineage, enables incremental processing, and separates concerns between data engineering and business logic.

2. Compaction Jobs

What it is: Background processes that merge small files into larger, optimally-sized files.

The Problem

Streaming creates many small files
Small files = poor query performance
Metadata overhead increases
Storage inefficiency

The Solution

Merge 1000s of small files → few large files
Target: 128MB - 1GB per file
Run during low-usage windows
Automatic in modern table formats

                    Example:

                    Before: 10,000 files × 1MB = 10GB (slow queries)

                    After: 10 files × 1GB = 10GB (fast queries)

3. Vacuum Operations

What it is: Cleanup processes that remove old file versions and unused data files.

What Gets Cleaned

Old file versions (after updates/deletes)
Uncommitted transaction files
Temporary processing files
Expired time-travel snapshots

Configuration

Retention: Keep 7-30 days of history
Schedule: Daily/weekly cleanup
Safety: Never delete active files
Cost: Reduces storage costs significantly

Important: Vacuum operations are irreversible. Once old versions are deleted, time-travel queries to those points will fail.

4. Z-Ordering (Data Clustering)

What it is: A technique that physically reorganizes data to co-locate related information, dramatically improving query performance.

How It Works

Maps multi-dimensional data to 1D
Uses space-filling Z-curve algorithm
Co-locates related data in same files
Reduces data scanning during queries

When to Use

Queries filter on multiple columns
Range queries (date ranges, numeric ranges)
Join operations on clustered columns
Large tables with selective queries

                    Example:

                    Table: customer_orders (customer_id, order_date, amount)

                    Z-Order by: customer_id, order_date

                    Result: Orders for same customer + date range stored together

                    Query: "Orders for customer 123 in March" → scans 1 file instead of 100

Performance Impact: Can improve query performance by 10-100x for selective queries on large tables.

5. Liquid Clustering (Auto-Optimization)

What it is: Next-generation clustering that automatically optimizes data layout without manual intervention.

Key Features

Automatic: No manual OPTIMIZE commands
Incremental: Only processes changed data
Adaptive: Learns from query patterns
Cost-Effective: Optimizes during writes

Advantages over Z-Order

No expensive OPTIMIZE operations
Handles changing query patterns
Better for streaming workloads
Supports unlimited clustering keys

                    Setup Example:

                    CREATE TABLE sales (

                      customer_id INT,

                      order_date DATE,

                      amount DECIMAL

                    ) CLUSTER BY (customer_id, order_date);

                    // Data automatically clusters as it's written

Best Practice: Choose clustering keys based on your most common query filters. Start with 2-3 columns that appear frequently in WHERE clauses.

Pattern Selection Guide

Use Case	Recommended Pattern	Why
New data lake setup	Medallion Architecture	Provides structure and governance
Streaming data ingestion	Compaction + Liquid Clustering	Handles small files automatically
Large analytical tables	Z-Ordering	Dramatic query performance improvement
Cost optimization	Vacuum Operations	Reduces storage costs by 30-70%
Modern lakehouse	All patterns combined	Maximum performance and efficiency

🏠 Back to Home

🏗️ Architecture Patterns Explained