Apache Iceberg with Python

Apache Iceberg with Python

A comprehensive hands-on learning repository for Apache Iceberg, designed for data engineers and professionals working with modern data lake architectures.

Overview

Apache Iceberg is a table format for large analytical datasets. Think of it as a specification for organizing collections of files in object storage to behave like a proper database table with ACID transactions, schema evolution, and time travel capabilities.

Key Learning Objectives

  • Understand table formats vs file formats (Iceberg vs Parquet)
  • Master ACID transactions in data lakes
  • Implement schema evolution without breaking existing queries
  • Use time travel for data versioning and rollback
  • Integrate with MinIO object storage for production patterns

Prerequisites

  • Python 3.12+
  • Basic SQL knowledge
  • Understanding of data formats (CSV, JSON, Parquet)
  • Familiarity with object storage concepts

Quick Start

# Clone the repository
git clone https://github.com/hardwaylabs/learn-iceberg-python.git
cd learn-iceberg-python

# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Start with the ETL demo
cd iceberg-etl-demo
uv sync

# Generate sample data and run first tutorial
uv run src/generate_logs.py
uv run src/01_create_table.py

Learning Path

1. ETL Demo

Focus: Fundamentals of Iceberg table operations

  • Create your first Iceberg table
  • Load data from CSV to Parquet
  • Schema evolution in practice
  • Time travel queries
  • CLI tools for exploration

Time commitment: 2-4 hours

2. MinIO Integration

Focus: Production deployment patterns

  • Connect Iceberg to MinIO object storage
  • S3-compatible configuration
  • Local development vs production patterns
  • Performance optimization
  • Monitoring and debugging

Time commitment: 4-6 hours

3. Real-time Streaming (Coming Soon)

Focus: Modern data pipeline architectures

  • Stream processing with Iceberg
  • Late-arriving data handling
  • Exactly-once semantics
  • Integration with Kafka/Kinesis

4. Analytics Workbench (Coming Soon)

Focus: Multi-engine data analysis

  • Query same data with DuckDB, Spark, Trino
  • Performance comparisons
  • Query optimization techniques

Key Concepts

Iceberg vs Parquet

Common confusion: Are they competing formats?

Reality: They work at different layers:

  • Parquet = File format (how individual files store data)
  • Iceberg = Table format (how collections of files are organized)

Think of it like: Parquet is individual books, Iceberg is the library catalog system.

Core Architecture

Iceberg’s architecture is similar to Docker container images:

Docker Image                    │  Iceberg Table
├── manifest.json              │  ├── metadata.json
├── config.json                │  ├── version-hint.text
└── layers/ (tar.gz)           │  └── data/ (parquet files)

Both use:

  • Layered architecture with file reuse
  • Immutable artifacts
  • Metadata-driven assembly
  • Content addressing
  • Incremental updates

Why Object Storage + Iceberg?

The combination provides:

From Object Storage:

  • Massive scalability
  • Decoupled compute and storage
  • Multi-engine access
  • Cloud-native design

From Iceberg:

  • ACID transactions
  • Schema evolution
  • Time travel
  • Snapshot isolation
  • Performance optimizations

Real-World Use Cases

Data Lake Modernization

Transform existing data lakes into reliable, ACID-compliant systems without vendor lock-in.

Financial Data

Handle complex audit requirements with immutable snapshots and complete change history.

IoT and Time-Series

Efficiently manage high-volume sensor data with automatic file organization.

Data Science Workflows

Enable reproducible analysis with time travel queries and schema evolution.

Project Structure

learn-iceberg-python/
├── iceberg-etl-demo/       # Basic Iceberg operations
│   ├── src/
│   │   ├── generate_logs.py
│   │   ├── 01_create_table.py
│   │   ├── 02_load_data.py
│   │   └── 03_time_travel.py
│   └── data/
├── iceberg-minio-demo/     # MinIO integration
│   ├── docker-compose.yml
│   ├── src/
│   └── config/
└── docs/                   # Additional documentation

Common Operations

Creating Tables

from pyiceberg.catalog import load_catalog

catalog = load_catalog("local")
table = catalog.create_table(
    "logs.access_logs",
    schema=schema,
    partition_spec=PartitionSpec(
        PartitionField("day", Transform.day(), "timestamp")
    )
)

Schema Evolution

# Add new column without breaking existing queries
table.update_schema().add_column(
    "user_agent", StringType(), "User agent string"
).commit()

Time Travel

# Query data as it existed yesterday
snapshot_id = table.history()[-2].snapshot_id
df = table.scan(snapshot_id=snapshot_id).to_pandas()

Troubleshooting

Common Issues

Import errors: Ensure you’re using Python 3.12+ and have run uv sync

MinIO connection: Check Docker is running and ports are not in use

Performance: Use partition pruning and file statistics for optimization

Next Steps

After completing this lab, you’ll be ready to:

  • Design data lake architectures with Iceberg
  • Migrate existing data lakes to Iceberg
  • Build real-time data pipelines
  • Implement data governance with time travel

Resources

Support