Apache Iceberg with Python
A comprehensive hands-on learning repository for Apache Iceberg, designed for data engineers and professionals working with modern data lake architectures.
Overview
Apache Iceberg is a table format for large analytical datasets. Think of it as a specification for organizing collections of files in object storage to behave like a proper database table with ACID transactions, schema evolution, and time travel capabilities.
Key Learning Objectives
- Understand table formats vs file formats (Iceberg vs Parquet)
- Master ACID transactions in data lakes
- Implement schema evolution without breaking existing queries
- Use time travel for data versioning and rollback
- Integrate with MinIO object storage for production patterns
Prerequisites
- Python 3.12+
- Basic SQL knowledge
- Understanding of data formats (CSV, JSON, Parquet)
- Familiarity with object storage concepts
Quick Start
# Clone the repository
git clone https://github.com/hardwaylabs/learn-iceberg-python.git
cd learn-iceberg-python
# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# Start with the ETL demo
cd iceberg-etl-demo
uv sync
# Generate sample data and run first tutorial
uv run src/generate_logs.py
uv run src/01_create_table.py
Learning Path
1. ETL Demo
Focus: Fundamentals of Iceberg table operations
- Create your first Iceberg table
- Load data from CSV to Parquet
- Schema evolution in practice
- Time travel queries
- CLI tools for exploration
Time commitment: 2-4 hours
2. MinIO Integration
Focus: Production deployment patterns
- Connect Iceberg to MinIO object storage
- S3-compatible configuration
- Local development vs production patterns
- Performance optimization
- Monitoring and debugging
Time commitment: 4-6 hours
3. Real-time Streaming (Coming Soon)
Focus: Modern data pipeline architectures
- Stream processing with Iceberg
- Late-arriving data handling
- Exactly-once semantics
- Integration with Kafka/Kinesis
4. Analytics Workbench (Coming Soon)
Focus: Multi-engine data analysis
- Query same data with DuckDB, Spark, Trino
- Performance comparisons
- Query optimization techniques
Key Concepts
Iceberg vs Parquet
Common confusion: Are they competing formats?
Reality: They work at different layers:
- Parquet = File format (how individual files store data)
- Iceberg = Table format (how collections of files are organized)
Think of it like: Parquet is individual books, Iceberg is the library catalog system.
Core Architecture
Iceberg’s architecture is similar to Docker container images:
Docker Image │ Iceberg Table
├── manifest.json │ ├── metadata.json
├── config.json │ ├── version-hint.text
└── layers/ (tar.gz) │ └── data/ (parquet files)
Both use:
- Layered architecture with file reuse
- Immutable artifacts
- Metadata-driven assembly
- Content addressing
- Incremental updates
Why Object Storage + Iceberg?
The combination provides:
From Object Storage:
- Massive scalability
- Decoupled compute and storage
- Multi-engine access
- Cloud-native design
From Iceberg:
- ACID transactions
- Schema evolution
- Time travel
- Snapshot isolation
- Performance optimizations
Real-World Use Cases
Data Lake Modernization
Transform existing data lakes into reliable, ACID-compliant systems without vendor lock-in.
Financial Data
Handle complex audit requirements with immutable snapshots and complete change history.
IoT and Time-Series
Efficiently manage high-volume sensor data with automatic file organization.
Data Science Workflows
Enable reproducible analysis with time travel queries and schema evolution.
Project Structure
learn-iceberg-python/
├── iceberg-etl-demo/ # Basic Iceberg operations
│ ├── src/
│ │ ├── generate_logs.py
│ │ ├── 01_create_table.py
│ │ ├── 02_load_data.py
│ │ └── 03_time_travel.py
│ └── data/
├── iceberg-minio-demo/ # MinIO integration
│ ├── docker-compose.yml
│ ├── src/
│ └── config/
└── docs/ # Additional documentation
Common Operations
Creating Tables
from pyiceberg.catalog import load_catalog
catalog = load_catalog("local")
table = catalog.create_table(
"logs.access_logs",
schema=schema,
partition_spec=PartitionSpec(
PartitionField("day", Transform.day(), "timestamp")
)
)
Schema Evolution
# Add new column without breaking existing queries
table.update_schema().add_column(
"user_agent", StringType(), "User agent string"
).commit()
Time Travel
# Query data as it existed yesterday
snapshot_id = table.history()[-2].snapshot_id
df = table.scan(snapshot_id=snapshot_id).to_pandas()
Troubleshooting
Common Issues
Import errors: Ensure you’re using Python 3.12+ and have run uv sync
MinIO connection: Check Docker is running and ports are not in use
Performance: Use partition pruning and file statistics for optimization
Next Steps
After completing this lab, you’ll be ready to:
- Design data lake architectures with Iceberg
- Migrate existing data lakes to Iceberg
- Build real-time data pipelines
- Implement data governance with time travel
Resources
Support
- Open issues on GitHub
- Join the Apache Iceberg Slack community