Explore our expert-made templates & start with the right one for you.
Iceberg 101: Working with Iceberg Tables
TLDR: The Iceberg format is a logical table with underlying data stored in columnar formats on cloud object storage. When working with Iceberg tables, you want to ensure best practices such as choosing the right partitioning scheme, compacting small files, managing data retention, and managing schema evolution.
What is an Iceberg table?
An Apache Iceberg table is a logical table that references columnar data stored in a cloud object store like Amazon S3, alongside relevant metadata. Underlying data files are stored in a columnar format (Parquet, ORC), based on a partitioning scheme defined in the table metadata. The metadata layer is used to track the files in a table along with schemas, partitions, and other table properties. This metadata is stored in manifest files, which contain a list of data files along with each file’s partition data. Manifest files are then tracked by a manifest list file, which is referenced by a metadata file that maintains the table’s state across multiple versions or snapshots.
To learn more, read our previous articles on the Iceberg table format and the differences between Iceberg and Parquet.
Working with Iceberg Tables: Create, Read, and Update Tables
Creating Iceberg tables from raw data can be done using various tools and frameworks, such as Apache Spark, Apache Flink, or Upsolver’s data ingestion platform. These tools allow you to read data from various sources, apply necessary transformations, and write the processed data into Iceberg tables. Upsolver offers both a zero-ETL approach to create Iceberg tables directly from raw data sources like S3, Kafka, or databases, handling schema evolution and data partitioning automatically; as well as managed transformations, when these are needed.
Integrating Iceberg tables into your existing data lake or data warehouse flows is straightforward. Iceberg tables can be easily queried using various engines like Trino, Presto, Spark, Dremio, and cloud data warehouses like Snowflake and BigQuery. This allows you to leverage Iceberg’s performance optimizations and schema evolution capabilities while using the query engines you’re already familiar with. You can create ETL pipelines that read from Iceberg tables, perform transformations, and write the results back to Iceberg or other destinations, depending on your use case.
Updating and modifying Iceberg tables is supported through tools such as Upsolver and Spark SQL, which allows you to perform row-level updates and deletes efficiently, thanks to Iceberg’s merge-on-read capabilities. You can also evolve the schema of an Iceberg table over time by adding, removing, or renaming columns, without the need to rewrite the entire table. Iceberg handles schema evolution by maintaining a history of schema changes in its metadata layer, making it possible to query across different schema versions.
Best Practices for Working with Iceberg Tables
- Partitioning: Choosing the right partitioning scheme is crucial for query performance. Iceberg supports hidden partitioning, which allows you to partition data based on a table configuration without the need to specify partition columns in your queries. This enables more flexible partition evolution compared to traditional explicit partitioning in Hive-style tables. When designing your partitioning scheme, consider the most common query patterns and partition by columns frequently used in filters to minimize the amount of data scanned.
- Compaction and optimization: Over time, Iceberg tables can accumulate many small files, especially when ingesting streaming data. This can negatively impact query performance. To mitigate this, regularly compact your Iceberg tables using Upsolver’s built-in functionality. Compaction merges small files into larger ones, improving scan performance and reducing storage costs. Additionally, consider sorting your data within each partition based on frequently queried columns to enable efficient range scans.
- Data retention: Managing data retention is essential to keep storage costs under control and comply with data governance policies. Iceberg provides snapshot expiration and time travel capabilities, allowing you to easily delete old snapshots and maintain a configurable history of table versions. Implement a retention policy that aligns with your business requirements and automate the process of expiring old snapshots using tools like Upsolver or by leveraging Iceberg’s API.
- Schema evolution and query engine support: Iceberg enables schema evolution, allowing you to add, remove, or rename columns without the need to rewrite the entire table. However, it’s crucial to ensure that your query engines support Iceberg’s schema evolution features. Some query engines might have limited support for certain schema changes or require additional configuration. Test your schema evolution workflows with the query engines you plan to use and keep their Iceberg support in mind when designing your data architecture.
Accelerate Your Iceberg Table Management with Upsolver
Upsolver simplifies the process of creating, managing, and optimizing Iceberg tables. With Upsolver, you can easily ingest data from various sources, such as S3, Kafka, or databases, and create Iceberg tables with just a few clicks. Upsolver automatically handles schema evolution, data partitioning, and small file compaction, ensuring optimal performance and storage efficiency.
You can use our open source Iceberg Table Analyzer to scan your tables and identify areas for improvement, such as repartitioning, sorting, or compaction. Additionally, Upsolver offers a fully-managed Iceberg optimization service that continuously monitors and optimizes your tables based on best practices and your specific workload patterns. Using these automation capabilities, you can focus on deriving insights from your data while the platform takes care of the underlying Iceberg table management complexities.
>> Get a free trial of Upsolver