Iceberg 101: What is the Iceberg Table Format?

TLDR: The Apache Iceberg table format is a management layer over data files in cloud storage. Iceberg tracks table schema, partitioning, and file-level metadata to enable warehouse-like features such as schema evolution, hidden partitioning, time travel, ACID transactions, and query optimization on top of data lakes.


Choosing the right storage format is crucial for optimizing performance, cost, and flexibility when working with cloud data. While file formats like Parquet and Avro have been popular choices for storing data in data lakes, in recent years a new category has emerged to provide more management capabilities on top: table formats. Among these, Apache Iceberg has been gaining significant adoption. So what exactly is Iceberg and why does it matter? Let’s dive in.

What is a table format? How is it different from a file format?

File formats like Apache Parquet define how data is serialized and stored within individual files, focusing on storage efficiency and read performance. In contrast, table formats like Iceberg provide a management layer on top of file formats. They define how a logical table is mapped across many physical data files.

You can think of a table format as providing entity-like semantics similar to tables in a database, but applied to files in cheap object storage. Table formats track schema, partitioning, and file-level metadata to optimize access and management of the underlying data files. Critically though, table formats themselves are not query engines. Rather, query engines leverage table formats to provide more optimized and feature-rich access to the data.

What is the Iceberg table format?

Apache Iceberg is an open source, high-performance table format designed for huge analytic tables. Iceberg tracks data in a table in two levels. First, a central metadata store tracks the table schema and partitioning. Second, Iceberg tracks every data file in a table, along with file-level stats and partition information. This detailed metadata powers Iceberg’s advanced features.

Iceberg is most commonly used to implement a “lakehouse” architecture. Lakehouses combine the key features of data warehouses, like ACID transactions and SQL queries, with the cost effectiveness, flexibility and scale of data lakes. Iceberg provides the metadata layer to enable warehouse-like semantics on top of data lake storage.

Apache Iceberg vs Alternatives

Iceberg is not the only table format vying for this space. Other contenders include Delta Lake, open sourced by Databricks, and Apache Hudi, originally developed at Uber. Many query engines have also implemented proprietary table formats. However, Iceberg’s open source approach and rapidly growing ecosystem have made it a leading standard.

How Iceberg Tables Work

Apache Iceberg tables are logical tables that reference columnar data stored in cloud object stores like Amazon S3, along with associated metadata. The underlying data is stored in columnar formats such as Parquet or ORC, organized according to a partitioning scheme defined in the table metadata. Iceberg uses a sophisticated metadata layer to track the files in a table, including schemas, partitions, and other properties.

Iceberg tables can be created, read, and updated using various tools and frameworks such as Apache Spark, Apache Flink, or Upsolver’s data ingestion platform. These tables can be easily queried using engines like Trino, Presto, Spark, Dremio, and cloud data warehouses like Snowflake and BigQuery, allowing users to leverage Iceberg’s performance optimizations and schema evolution capabilities.

Learn more about working with Iceberg tables.

Benefits of the Iceberg Table Format

  • Open source and open standards promote flexibility and interoperability. Iceberg is supported by an expanding ecosystem of tools while avoiding vendor lock-in.
  • Support for schema evolution makes it easy to change the table schema while maintaining compatibility with existing data. Iceberg handles schema mapping and stores full schema history. 
  • Hidden partitioning automatically maps data to partition values based on the table configuration. Partition values don’t need to be stored in the data files themselves. This simplifies data layout and enables easy partition evolution.
  • Time travel allows querying historical table snapshots and rolling back tables. Iceberg’s metadata store tracks every version of a table, allowing recreation of the table at any point in time.
  • ACID transactions provide atomic, isolated table updates and deletes. Iceberg uses an optimistic concurrency model to implement transactions across concurrent engines.
  • Partition pruning and file-level stats dramatically reduce the amount of data scanned per query. Iceberg tracks min/max stats per column for each file, allowing filtering of unnecessary partitions and files.

Managing Iceberg Tables with Upsolver

While Iceberg provides a powerful foundation, managing Iceberg tables at scale still requires significant operational overhead. This is where Upsolver comes in. Upsolver is a fully managed data lake platform that automates the creation and optimization of Iceberg tables.

With Upsolver, you can easily ingest streaming andor batch data from any source and land it in auto-optimized Iceberg tables. Upsolver automatically handles table creation, schema updates, partitioning, compaction, and more. It also provides a visual SQL interface for transforming and querying Iceberg tables without any infrastructure to manage.

In addition, Upsolver offers an Iceberg Table Optimizer which can analyze and optimize any existing Iceberg table, even those not created by Upsolver. The optimizer compacts small files, optimizes file sizes, and cleans up stale metadata to keep query performance high and costs low. There’s also a lightweight, open source diagnostics tool you can start using instantly.

By combining the power of Iceberg with Upsolver’s automation and management capabilities, organizations can implement a highly optimized lakehouse architecture with minimal operational overhead. This allows data teams to focus on deriving insights rather than managing infrastructure.

>> Try Upsolver for free, here

Published in: Blog , Cloud Architecture
Upsolver Team
Upsolver Team

Upsolver enables any data engineer to build continuous SQL data pipelines for cloud data lake. Our team of expert solution architects is always available to chat about your next data project. Get in touch

Keep up with the latest cloud best practices and industry trends

Get weekly insights from the technical experts at Upsolver.

Subscribe

Templates

All Templates

Explore our expert-made templates & start with the right one for you.