Getting Started with Apache Iceberg on Amazon S3

Apache Iceberg is an open-source table format that’s revolutionizing the way organizations store and manage data on object stores like Amazon S3. Iceberg brings database-like features to large-scale data storage, making it an ideal solution for modern data architectures. When combined with Amazon S3, Iceberg creates a powerful foundation for building scalable, flexible, and cost-effective data lakes and lakehouses, enabling organizations to handle massive amounts of data while maintaining performance and data integrity.

This article covers the basics of working with Iceberg on S3, including the various tools and integrations you need to be aware of, and covers some best practices for ensuring high performance.

>> Need a refresher on Apache Iceberg and the data lakehouse? Check out these resources:

Why Iceberg and Amazon S3?

Amazon S3 is the core building block of modern scalable architectures such as the data data lakehouse. It provides virtually unlimited storage capacity, high durability, and low-latency access to data. Its ability to handle massive amounts of unstructured and structured data, coupled with its integration with various AWS services, makes it an ideal foundation for building flexible and scalable data solutions.

Storing the data in the Iceberg table format offers multiple added benefits compared to storing data as Parquet on S3, including:

  • ACID transactions: Iceberg ensures data consistency and atomicity, which is not guaranteed with plain Parquet files on S3.
  • Schema evolution: Iceberg allows for easy changes to table schemas without requiring data migration, unlike traditional Parquet storage.
  • Partition evolution: Iceberg supports changing partition schemes without data rewrites, offering more flexibility than static partitioning in S3.
  • Time travel and rollback: Iceberg maintains snapshots, enabling point-in-time queries and easy rollbacks, features not available with plain S3 storage.
  • Improved query performance: Iceberg’s metadata management and partition pruning capabilities can significantly speed up queries compared to scanning raw Parquet files.

An example scenario: using Iceberg on S3 for AI on security data

Let’s imagine we’re a cybersecurity company that’s using AI to detect threat patterns in log data. They ingest terabytes of logs from various sources, including network devices, servers, and applications, and need to be able to query this data in multiple tools including Amazon Athena for ad-hoc analysis, Snowflake for customer-facing dashboards, Amazon Sagemaker for ML model training, and Amazon Bedrock for generative AI workloads.

Iceberg and S3 would offer numerous advantages compared to alternative architectures, such as storing all the data in a high-performance data warehouse:

  • Building on the near-infinite scalability of S3, the company can ingest more security logs without worrying about storage costs
  • As new types of log data emerge, Iceberg’s support for schema evolution will allow the company to incorporate these sources without disrupting existing workflows
  • Data will be easily accessible by the various cloud services they are using for AI and analytics, without having to replicate the data

Managing Your Iceberg Tables and Underlying Data

Iceberg tables on S3 consist of data files, usually stored in the Apache Parquet format, containing the actual table data; and metadata files including manifest lists, manifest files, and a metadata file, which track the state and structure of the table. The metadata layer allows Iceberg to efficiently manage large datasets, enabling features like partition pruning and fast queries without scanning all data files.

Managing Iceberg tables involves maintaining the health, performance, and efficiency of your data storage and access patterns. Effective table management is crucial for maintaining query performance, controlling storage costs, and enabling seamless data access for various analytics and AI workloads. This is achieved via tools that offer data catalog integrations, with two main options being:

  • AWS Glue-managed tables: AWS Glue provides native support for Apache Iceberg, allowing you to create, manage, and query Iceberg tables within the AWS ecosystem. The AWS Glue Data Catalog acts as the metadata repository. This allows you to keep everything within the tightly-integrated AWS ecosystem.
  • Snowflake-managed tables: Snowflake supports external Iceberg tables, allowing you to query and manage Iceberg data stored in Amazon S3 directly from Snowflake. In this pattern, Amazon S3 stores the Iceberg data files and metadata, while Snowflake manages table metadata and provides query capabilities. Snowflake External Tables can be used to reference and query Iceberg data stored in S3.

These patterns are covered in more depth on the AWS blog.

While Iceberg tables offer many out-of-the-box improvements over non-Iceberg storage, it’s worth noting that using Iceberg isn’t a silver bullet when it comes to data management. As data volumes grow and query patterns evolve, tables require ongoing optimization to maintain performance and control costs – including data clustering, partition management, file compacation, and snapshot expiration.

Using Upsolver’s Adaptive Optimizer to improve performance and costs

Upsolver’s https://www.upsolver.com/adaptive-optimizerAdaptive Optimizer for Apache Iceberg is an intelligent agent that continuously audits and optimizes Iceberg tables, addressing the challenges of manual table management and optimization.

The Adaptive Optimizer works by analyzing the characteristics of your data and query patterns, automatically implementing optimizations to improve performance and reduce costs. It handles tasks such as file compaction, data clustering, and partition management without manual intervention. This means that as your data volumes grow and query patterns change, your Iceberg tables are continuously optimized to maintain peak performance.

By leveraging Upsolver’s Adaptive Optimizer, organizations can significantly reduce the engineering effort required to maintain their Iceberg-based data lakes. It allows data teams to focus on deriving insights from their data rather than managing the underlying infrastructure. The result is faster query performance, reduced storage costs, and a more efficient data management process overall.

Learn more about Adaptive Optimizer for Apache Iceberg

Integration with Other AWS services

To fully leverage Apache Iceberg on Amazon S3, it’s crucial to understand how it integrates with various AWS services. These integrations allow you to build robust data pipelines, perform analytics, and manage your data lakehouse effectively. Let’s look at the key AWS services that work with Iceberg:

  • AWS Glue: supports native Iceberg table creation, reading, and writing. It can also convert existing tables to Iceberg format. Glue can simplify ETL processes, provide a centralized metadata repository, and enables serverless data integration into Iceberg.
  • Amazon Athena: Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena supports querying Iceberg tables directly, taking full benefit of Iceberg’s built-in performance optimizations.
  • Amazon EMR: Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning applications. EMR supports Iceberg tables through Apache Spark and Hive integrations.
  • Amazon Redshift: Amazon Redshift supports querying Apache Iceberg tables,. This integration enables you to leverage Redshift’s powerful query engine while benefiting from Iceberg’s ACID compliance and schema evolution capabilities.
  • Amazon SageMaker: The SageMaker Feature Store supports creating feature groups in the offline store using the Apache Iceberg table format, allowing for efficient management of ML features. 

You can learn more about interacting with various AWS services using Iceberg’s iceberg-aws module in the official documentation

Data Ingestion Strategies for Iceberg on S3

Efficient data ingestion is crucial for maintaining a high-performing and cost-effective Iceberg-based data lakehouse on Amazon S3. We’ve covered this topic in more depth elsewhere:

Here are the key things you need to consider in your Iceberg ingestion layer:

  1. Batch vs. streaming: Depending on your use case, you may need to implement batch ingestion for large, periodic data loads, or streaming ingestion for real-time data processing. Iceberg supports both approaches, but each requires different tooling and considerations.
  2. File size optimization: Ingesting many small files can lead to performance issues. Implementing strategies to optimize file sizes during ingestion, such as file compaction, is crucial for maintaining query performance.
  3. Schema evolution: Iceberg’s schema evolution capabilities allow for flexible data ingestion as your data structures change over time. You will need to ensure your ingestion process can handle schema changes gracefully.
  4. Partitioning strategy: Choosing the right partitioning strategy during ingestion can significantly impact query performance. Iceberg’s hidden partitioning feature allows for more flexible partition evolution compared to traditional Hive-style partitioning.
  5. Data quality and validation: Implementing data quality checks and validation during the ingestion process helps maintain the integrity of your data lake.
  6. Metadata management: Efficient management of Iceberg’s metadata during ingestion is crucial for maintaining performance, especially for high-volume or high-frequency ingestion scenarios.

Operationalize Iceberg on S3 with Upsolver

Unlike general-purpose data movement tools or manual coding approaches, Upsolver is specifically designed for high-performance data ingestion into Iceberg tables on S3. It eliminates the need for complex engineering work, allowing data teams to focus on analytics and AI rather than managing infrastructure.

Upsolver ensures that your Iceberg-based data lakehouse on S3 remains performant, cost-effective, and scalable as your data volumes and complexity grow, all while significantly reducing the engineering effort required for maintenance and optimization.

To learn more, try Upsolver for free or request a 1:1 demo.

Published in: Blog , Cloud Architecture
Eran Levy
Eran Levy

As an SEO expert and content writer at Upsolver, Eran brings a wealth of knowledge from his ten-year career in the data industry. Throughout his professional journey, he has held pivotal positions at Sisense, Adaptavist, and Webz.io. Eran's written work has been showcased on well-respected platforms, including Dzone, Smart Data Collective, and Amazon Web Services' big data blog. Connect with Eran on LinkedIn

Keep up with the latest cloud best practices and industry trends

Get weekly insights from the technical experts at Upsolver.

Subscribe

Templates

All Templates

Explore our expert-made templates & start with the right one for you.