Apache Iceberg with Python

A comprehensive hands-on learning repository for Apache Iceberg, designed for data engineers and professionals working with modern data lake architectures.

Overview

Apache Iceberg is a table format for large analytical datasets. Think of it as a specification for organizing collections of files in object storage to behave like a proper database table with ACID transactions, schema evolution, and time travel capabilities.

Key Learning Objectives

Understand table formats vs file formats (Iceberg vs Parquet)
Master ACID transactions in data lakes
Implement schema evolution without breaking existing queries
Use time travel for data versioning and rollback
Integrate with MinIO object storage for production patterns

Prerequisites

Python 3.12+
Basic SQL knowledge
Understanding of data formats (CSV, JSON, Parquet)
Familiarity with object storage concepts

Quick Start

# Clone the repository
git clone https://github.com/hardwaylabs/learn-iceberg-python.git
cd learn-iceberg-python

# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Start with the ETL demo
cd iceberg-etl-demo
uv sync

# Generate sample data and run first tutorial
uv run src/generate_logs.py
uv run src/01_create_table.py

Learning Path

1. ETL Demo

Focus: Fundamentals of Iceberg table operations

Create your first Iceberg table
Load data from CSV to Parquet
Schema evolution in practice
Time travel queries
CLI tools for exploration

Time commitment: 2-4 hours

2. MinIO Integration

Focus: Production deployment patterns

Connect Iceberg to MinIO object storage
S3-compatible configuration
Local development vs production patterns
Performance optimization
Monitoring and debugging

Time commitment: 4-6 hours

3. Real-time Streaming (Coming Soon)

Focus: Modern data pipeline architectures

Stream processing with Iceberg
Late-arriving data handling
Exactly-once semantics
Integration with Kafka/Kinesis

4. Analytics Workbench (Coming Soon)

Focus: Multi-engine data analysis

Query same data with DuckDB, Spark, Trino
Performance comparisons
Query optimization techniques

Key Concepts

Iceberg vs Parquet

Common confusion: Are they competing formats?

Reality: They work at different layers:

Parquet = File format (how individual files store data)
Iceberg = Table format (how collections of files are organized)

Think of it like: Parquet is individual books, Iceberg is the library catalog system.

Core Architecture

Iceberg’s architecture is similar to Docker container images:

Docker Image                    │  Iceberg Table
├── manifest.json              │  ├── metadata.json
├── config.json                │  ├── version-hint.text
└── layers/ (tar.gz)           │  └── data/ (parquet files)

Both use:

Layered architecture with file reuse
Immutable artifacts
Metadata-driven assembly
Content addressing
Incremental updates

Why Object Storage + Iceberg?

The combination provides:

From Object Storage:

Massive scalability
Decoupled compute and storage
Multi-engine access
Cloud-native design

From Iceberg:

ACID transactions
Schema evolution
Time travel
Snapshot isolation
Performance optimizations

Real-World Use Cases

Data Lake Modernization

Transform existing data lakes into reliable, ACID-compliant systems without vendor lock-in.

Financial Data

Handle complex audit requirements with immutable snapshots and complete change history.

IoT and Time-Series

Efficiently manage high-volume sensor data with automatic file organization.

Data Science Workflows

Enable reproducible analysis with time travel queries and schema evolution.

Project Structure

learn-iceberg-python/
├── iceberg-etl-demo/       # Basic Iceberg operations
│   ├── src/
│   │   ├── generate_logs.py
│   │   ├── 01_create_table.py
│   │   ├── 02_load_data.py
│   │   └── 03_time_travel.py
│   └── data/
├── iceberg-minio-demo/     # MinIO integration
│   ├── docker-compose.yml
│   ├── src/
│   └── config/
└── docs/                   # Additional documentation

Common Operations

Creating Tables

from pyiceberg.catalog import load_catalog

catalog = load_catalog("local")
table = catalog.create_table(
    "logs.access_logs",
    schema=schema,
    partition_spec=PartitionSpec(
        PartitionField("day", Transform.day(), "timestamp")
    )
)

Schema Evolution

# Add new column without breaking existing queries
table.update_schema().add_column(
    "user_agent", StringType(), "User agent string"
).commit()

Time Travel

# Query data as it existed yesterday
snapshot_id = table.history()[-2].snapshot_id
df = table.scan(snapshot_id=snapshot_id).to_pandas()

Troubleshooting

Common Issues

Import errors: Ensure you’re using Python 3.12+ and have run uv sync

MinIO connection: Check Docker is running and ports are not in use

Performance: Use partition pruning and file statistics for optimization

Next Steps

After completing this lab, you’ll be ready to:

Design data lake architectures with Iceberg
Migrate existing data lakes to Iceberg
Build real-time data pipelines
Implement data governance with time travel

Resources

Support

Open issues on GitHub
Join the Apache Iceberg Slack community

OAuth 2.1 Implementation