Comparing the Top Iceberg Ingestion Tools – Upsolver, ELT, Open Source

Upsolver Team
Cloud Architecture
May 22, 2024

Engineering teams are increasingly interested in Iceberg, an open table format that provides a high-performance infrastructure for petabyte-scale tables in the cloud. With its hybrid architecture, Iceberg combines the benefits of a data warehouse without the vendor lock-in while delivering the flexibility and low-cost storage advantages of a data lake. The result is a high-performance, scalable solution for accessible and portable data.

There’s just one caveat: data ingestion. Apache Iceberg is a table format with no native data ingestion capabilities in the sense of providing direct connectors or ingestion out of the box. This means you’ll need to look for external tools (or APIs) to write your data into Iceberg in an efficient manner, continuously optimizing tables to ensure best possible query performance.

(Full disclosure: Upsolver is a vendor operating in this space and we are naturally biased towards our own solution. However, we have tried to include only accurate factual information below.)

Which tools support Iceberg ingestion?

While most popular data warehouses support dozens of tools and libraries for data ingestion, the Iceberg ingestion landscape is still emerging and there are relatively few tools currently available. Some of the main ones include:

1. Upsolver – lakehouse ingestion and management

Upsolver is a data ingestion platform built for operational and streaming data, including files, event streams and database replication using CDC. Upsolver’s adaptive event processing architecture is designed to handle the most difficult data quality, consistency and reliability challenges that arise. .

Industry-leading capabilities for petabyte-scale and streaming ingestion
Built-in file optimization, retention management, and catalog integration
Built-in ETL capabilities using SQL supporting multiple writers
Efficient compute and scalable pricing

Learn more:

Visit the Upsolver product page
Read the relevant Upsolver-Iceberg documentation
Free trial of Upsolver

2. ELT tools (e.g. Fivetran, Airbyte)

Fivetran and Airbyte are popular ELT tools that support hundreds of data sources and targets. However, they are designed as general-purpose data movement tools and may not be optimized for the challenges of high-volume and low-latency ingestion.

ELT tools are designed to extract data from various sources, load it into a target system, and then transform the data within the target system. While they support a wide range of connectors, they may not be optimized for ingesting high-volume and low-latency and frequently changing data sources. Transformations are performed in the data warehouse, which can significantly increase compute costs. Additionally, ELT tools rely on the data warehouse’s native capabilities for storage and management, which can lead to vendor lock-in and make them less suitable for data lakes and lakehouse architectures.

Learn more:

3. Open source tools (Spark, Flink)

Open source big data processing frameworks like Apache Spark and Apache Flink can be used to build custom data ingestion pipelines for Iceberg. They offer flexibility and control but require significant engineering effort to set up, maintain, and scale.

You’ll need to handle various aspects like consistency, exactly-once processing, schema evolution, and optimizations yourself. While open source provides cost benefits, the engineering time and resources required can be substantial.

Learn more:

Spark and Iceberg quickstart (Apache Iceberg documentation)
Flink and Iceberg getting started guide (Apache Iceberg documentation)

Comparison matrix (summary of the information that appears above)

	Upsolver	ELT tools (Fivetran, Airbyte)	Open source tools (Spark, Flink)
Big data ingestion capabilities	Industry-leading capabilities for petabyte-scale and streaming ingestion	Limited, focus is more on long tail of API connectors rather than higher data volumes	Can support very high throughput use cases but require a significant amount of engineering
Lake management and optimization	Built-in file optimization, retention management, catalog integration	Separate effort	Custom development
Advanced data transformations	Built in ETL capabilities, supports multiple writers, and open formats	Relies on external compute engines for transformations; potential for vendor lock in	Custom development
Pricing and total cost of ownership	Efficient compute and scalable pricing	Expensive compute, MAR pricing which grows exponentially with higher volume / streaming use cases	Requires tuning (engineering costs) to ensure efficiency

What to look for in an Iceberg ingestion tool

Big data ingestion capabilities

TLDR: Iceberg is typically used to store terabytes of data across thousands of tables. You need a data ingestion tool that can ensure high quality data, reliable delivery and ease of development and management as the number of tables and their size increase.

Writing data to Apache Iceberg is virtually no different from writing to a data warehouse like Snowflake. That said, you need to consider that you’re still ingesting data into object storage, so you can’t rely on the data warehouse’s capabilities when it comes to data quality and transformation. This means your Iceberg ingestion tool will need to guarantee:

Consistency: Ingestion tools should be able to guarantee that data is accurately and reliably ingested into Iceberg tables to provide a consistent view of the data to all users querying it. This includes maintaining data integrity, enforcing transactional boundaries, and effectively handling concurrency conflicts. Your choice of tool should support ACID properties, especially in distributed and parallel processing.
Reliability: Ingestion tools must be reliable, resilient, and capable of handling failures. They should include mechanisms for fault tolerance, error handling, retry strategies, and recovery processes to ensure uninterrupted data ingestion operations. High availability and uptime are essential for continuous ingestion in production.
Exactly-once: Exactly-once processing semantics prevent duplicate data and inconsistencies and are crucial for maintaining reliable ingestion workflows. They work by ensuring that each data record is processed and ingested into Iceberg tables exactly once, without duplication or loss.
Ease of use: Ingestion tools should be relatively easy to use, democratizing access by both technical and non-technical data users. Intuitive interfaces, monitoring capabilities, and features such as declarative development, automatic optimizations and tuning, and data observability enhance the usability of ingestion tools.

How do different tools compare in this category?

Upsolver is built for scale and is used by many customers to move petabytes of data into AWS data lakehouses (and data lakes). Reliability, consistency, and exactly once processing are built into every Upsolver data pipeline, and Iceberg is no exception. Upsolver is easy to use because it automatically creates target tables mapped to the data source, keeps schema in sync during incremental changes, and handles various cases, such as renamed tables.
ELT tools are typically less focused on high scale use cases, as their commercial model is focused on chasing the ‘long tail’ of connectors – i.e., automating API-based connectivity for a large number of sources.
Open source tools are a blank slate that can, of course, be designed in ways that support almost any scale of data ingestion. However, doing so will require a significant engineering effort; neither Spark nor Flink are by any means ‘plug and play’, and the effort required to manually tune them is significant.

Data lake management and optimization

TLDR: Effective data lake management and optimization are crucial for achieving the high performance, cost efficiency, and analytics-readiness associated with Iceberg-based data lakehouses. Key capabilities include compaction, cataloging, compression, and retention policies.

Robust data lake management and optimization capabilities are equally important to big data ingestion within an Iceberg-based data lakehouse architecture. This minimizes the learning curve required to create, update and use tables resulting in faster time to insight and improved productivity..

Examples of lake management features include:

Compaction: Combining small data files into larger ones to reduce metadata overhead and improve compression and scan performance. This improves query performance and optimizes storage consumption to reduce storage costs and improve data processing efficiency in Iceberg tables.
Cataloging and metadata management: Maintaining metadata information about data assets, schemas, partitions, and data lineage within the Iceberg table format. This facilitates data discovery, lineage tracking, and data governance within the Iceberg environment.
Compression: Reduce data storage requirements, improve data transfer speeds, and optimize query performance by reducing disk I/O. This helps to cut storage costs, optimizes data access, and enhances overall system performance.
Retention policies: Rules for managing data lifecycle from ingestion to deletion, and data versioning strategies. This ensures compliance with increasingly complex and rigorous data regulations and governance frameworks.

How do different tools compare in this category?

In Upsolver, all this functionality is unified and automated within the platform. This greatly simplifies day-to-day lakehouse management by streamlining workflows, particularly when compared to other open-source Iceberg solutions, which require separate third-party tools for processing, validation, and auditing.
ELT tools such as Fivetran and Airbyte are focused first and foremost on ingesting into data warehouses such as Snowflake and BigQuery, and hence will offer little in terms of data lakehouse management.
Lakehouse management is part of the Iceberg package available for Spark and Flink, but requires deep expertise to properly configure, as well as custom development to schedule and execute optimization and management tasks efficiently.

Advanced transformations

TLDR: When choosing an Iceberg ingestion tool, consider its ability to support transformations and concurrent incremental updates. These features ensure that the tool can adapt to evolving data processing needs and enable efficient data management within the lakehouse environment.

Although a tool’s ability to initially write data into Iceberg tables is, of course important, you also need to anticipate evolving needs and ensure that your chosen tool can seamlessly adapt to future requirements, including:

Transformations: Although ELT tools like Airbyte and Fivetran are popular choices for transforming data periodically (hours or days), near real-time or real-time data processing requires a unified batch+streaming engine. Lakehouse optimization: ELT tools aren’t designed for data lakes; they depend on warehouse native capabilities to store and manage Iceberg data. To get the full benefit from Iceberg, you need a tool that can filter and process data when storing in the lake, so your warehouse costs remain low. Look for features that enable efficient data filtering, processing, and optimization directly within the data lake environment, minimizing reliance on costly data warehouse operations.
Multiple writers and concurrent access: Support for multiple writers and concurrent access to Iceberg tables from different query engines and processing pipelines ensures that data can be ingested, updated, and queried concurrently by various teams, applications, or data workflows without bottlenecks or conflicts.

How do different tools compare in this category?

Upsolver supports batch and stream transformations, storing transient and intermediate files in the lake, automatically optimizing them for ad-hoc analytics or to further load into data warehouses for analysis and BI.
ELT tools rely on external compute engines and lack built-in transformation capabilities, often requiring additional tools like dbt Core. This complicates the architecture and increases costs.
Open source solutions like Spark and Flink offer flexibility but require significant engineering effort to implement and maintain transformations and lakehouse management features.

Pricing and total cost of ownership

TLDR: Tools that charge based on events or monthly active rows can become prohibitively expensive when working with big, frequently changing data.

Last (but by no means least!), pricing needs to be considered in your decision-making. Evaluating pricing between different tools and providers can be difficult because there’s often a lot of nuance that goes into pricing models; more expensive doesn’t always mean better, and vice versa.

Upsolver’s high-scale data ingestion tool can cost 76% less than Fivetran for high-volume change data capture and streaming data movement and more than 90% cheaper for near real-time streaming workloads. This is because Fivetran is designed and priced for low-volume data movement. As data volume and change frequency increase in Fivetran, so too does the price.

With open source, it’s the usual story – software is free, but you pay in engineering costs, long lead times and diverted resources.

Get Started with Upsolver’s Iceberg ingestion

Still unsure which solution is best for your Iceberg ingestion needs? Take Upsolver for a spin and see how it works out for you! Start with a free trial or get a guided tour from our solution architects.

Published in: Blog , Cloud Architecture

Upsolver Team

Upsolver enables any data engineer to build continuous SQL data pipelines for cloud data lake. Our team of expert solution architects is always available to chat about your next data project. Get in touch

Comparing the Top Iceberg Ingestion Tools – Upsolver, ELT, Open Source

Which tools support Iceberg ingestion?