The Iceberg Data Lakehouse Stack: Choose the Right Building Blocks 

Data warehousing was the go-to method for data analysis and reporting for a long time. In recent years, however, we’ve seen the emergence of a new type of open data management architecture called the data lakehouse

The Data Lakehouse: Some Assembly Required

The data lakehouse represents a new paradigm in data management, combining the performance and structure of data warehouses with the flexibility and cost-efficiency of data lakes. By leveraging the open source Iceberg table format, the Iceberg lakehouse enables data teams to work with petabyte-scale datasets across multiple analytics engines, while enjoying features like ACID transactions, schema evolution, and partition evolution.

Teams that wish to deploy their own data lakehouse will have to assemble it themselves rather than rely on a single vendor (as might be the case with a cloud data warehouse). In many senses this is a feature and not a bug, as it allows organizations to build on open architectures and best-in-class tooling while avoiding vendor lock-in.

There are two options for doing so: 

  1. DIY the process and custom build a lakehouse from the ground up.
  2. Use various third-party tools together to accelerate the time to deployment.

Although the first option sounds ideal on paper, it cannot be implemented quickly. Plan for six months at a minimum. It will also be cost-prohibitive for all but the largest companies that have established data and platform engineering teams. 

While this is still a relatively involved process, it becomes much easier when you reduce the number of tools, third-party or open source, you implement within your lakehouse ecosystem. You should, therefore, look for and prioritize solutions that cover several bases rather than trying to meld together many different components.

Let’s take a look at the key, non-negotiable building blocks that are needed for any lakehouse deployment and explore some of the tools that you might consider implementing to cover them. 

Essential Elements of a Data Lakehouse

Lakehouse architecture and tools diagram

Data Sources

Your data sources are foundational to your lakehouse deployment because they provide the raw material for analytics, reporting, and other data-driven activities. It’s therefore important that you consider the type(s) of sources you will bring to your lakehouse. 

  • Batch Data: This data is collected and processed in predefined batches at scheduled intervals. It can come from various sources, such as flat files, log files, and data warehouses. Batch data is ideal for loading large volumes of historical data or data delivered from partners where real-time processing isn’t required. 
  • Streaming Data: This data is continuously collected, ingested, and processed in real-time or near real-time. It can come from sources such as IoT devices, online platforms, sensors, clickstream data, and real-time monitoring systems. Streaming data is important for applications that require rapid decision-making or immediate insights.
  • Databases: Relational and non-relational databases hold business critical application data that can be mixed with other data in the lakehouse for improved analytics and AI. SQL databases, MongoDB, Redis, and others can be used continuously ingested into the lakehouse and presented as tables for users to query and analyze.
  • Files: This data includes the various data formats stored on disk or in the cloud that can be structured or semi-structured. A ubiquitous form of data storage and exchange, file formats such as CSV, JSON, Parquet and XML can be used for archiving, data interchange between systems, data backups, and serving as inputs for batch processing pipelines.

In your data lakehouse, these data sources are integrated and ingested for storage and processing. By feeding into this architecture, your data sources ensure a continuous flow of data to power various applications, analytics and AI. 

Data ingestion

Author’s note: we’ve covered this topic in more detail in our guide to Iceberg ingestion.

Iceberg, being a table format, lacks native data ingestion capabilities. This means that data teams need to look for external tools or APIs to write data into Iceberg efficiently and continuously optimize tables for the best possible query performance.

Tools for data ingestion into Iceberg include:

  • Upsolver: A fully-managed data ingestion platform built for operational and streaming data, including files, event streams, and database replication using change data capture (CDC). Upsolver offers industry-leading capabilities for petabyte-scale and streaming ingestion, built-in file optimization, retention management, catalog integration, and ETL using SQL.
  • ELT Tools (Fivetran, Airbyte): Popular ELT tools that support a wide range of data sources and targets. However, they are designed as general-purpose data movement tools and may not be optimized for the challenges of high-volume and low-latency ingestion. ELT tools rely on the data warehouse’s native capabilities for storage and management, which can lead to vendor lock-in and make them less suitable for lakehouse architectures.
  • Open Source Tools (Spark, Flink): Big data processing frameworks that can be used to build custom data ingestion pipelines for Iceberg. While they offer flexibility and control, they require significant engineering effort to set up, maintain, and scale. Data teams need to handle various aspects like consistency, exactly-once processing, schema evolution, and optimizations themselves.

When selecting an Iceberg ingestion tool, data teams should consider factors such as big data ingestion capabilities, lake management and optimization features, support for advanced transformations, and pricing models. Upsolver stands out in these areas, offering scalable ingestion, automated lake management, unified batch and streaming transformations, and cost-effective pricing compared to alternatives like ELT tools.

Transformations and ETL (Extract, Transform, Load) 

Data transformation is an important step in lakehouse workflows due to the absence of inherent computing capabilities. Unlike traditional databases, which typically include built-in processing, a data lakehouse relies on separate compute engines or destination databases to execute transformations. This is because, in a data lakehouse environment, the bulk of the focus is (or should be) on the efficient storage of raw and curated data. 

If data transformation were to be tightly coupled with the lakehouse storage, compute-intensive operations such as data cleansing, normalization, and aggregation would lose on the efficiency gains that the lakehouse enables, especially when dealing with large-scale datasets or real-time streaming data. By separating data transformation from the lakehouse storage itself, it’s much easier to enable more scalable and efficient processing without overloading the storage layer. 

Tools for data lakehouse transformation management include:

  • Upsolver: A powerful and scalable cloud-managed stream and batch processing engine that supports SQL-based or visual transformations. This makes it easier for data teams to perform complex transformations without the need for extensive coding.
  • Apache Spark: An open-source tool that offers robust transformation capabilities but requires a significant amount of coding input along with separate infrastructure management for deployment and scaling. 
  • AWS Glue / Databricks: AWS and Databricks offer managed Spark environments that streamline workflows and reduce infrastructure overhead. However, complexities related to Spark programming and orchestration still exist. 
  • Dremio: A data lakehouse query solution that simplifies data transformation processes required to prepare data for analysis through the Dremio query engine. 

By implementing external compute engines like Spark or managed transformation tools like Upsolver, data teams can scale and optimize their data transformations while maintaining data quality. 

Lake Management

Robust lake management practices help data teams to ensure their lakehouse environment can achieve best performance, optimized costs, and general readiness for high-volume data analytics tasks. 

In terms of performance optimization, a well-managed lakehouse organizes data efficiently and reduces the time and resources that are needed for discovery and access. Proper organization through structured folders, metadata tagging, and indexing all enhance query performance and enable data teams to seamlessly work with high volumes of structured and unstructured data. 

In addition, data compaction and partitioning reduce I/O overhead, improve read/write performance, and partition data based on relevant attributes such as date or region. Both of these are lake management best practices and can have a significant impact on cost, analytics readiness, and query performance by reducing processing times.

Proper lake management also lends itself to data analytics readiness. Data quality checks, validation rules, and cleansing all help to ensure data accuracy and consistency for meaningful analytics and decision-making. Alongside this, metadata and tag management aid discovery, access control, and compliance, making it more usable for analytics use cases. 

The various workflows and processes that form part of day-to-day lake management necessitate the use of third-party tools that can do all the above and more for you.

  • Upsolver: A best-in-class stream and batch processing engine handles the ingestion, storage and optimization of small and large Iceberg tables, including CDC from production databases. Designed to scale, Upsolver is the best option for teams that want a fully automated, intelligent solution to maintain a high performance, cost efficient and compliant Lakehouse.
  • Tabular: Tabular, now part of Databricks, is built around Apache Iceberg and brings a simple to use unified catalog for Iceberg tables. In addition, Tabular provides basic optimization and maintenance tasks for Iceberg tables it creates. 
  • Apache Spark / Hive: Apache Spark and Hive include Iceberg lake management functionality but require significant manual intervention and experience to properly and optimally configure and schedule maintenance tasks to get the desired performance and cost savings. 

Data Catalogs

Data catalogs sit at the core of good data governance and are essential for data sharing and self-service. According to Gartner, demand for data catalogs is soaring as data teams struggle with finding, inventorying, and analyzing burgeoning data assets. 

A data catalog is essentially a centralized repository for metadata such as data schemas, lineage, and quality metrics that acts as an inventory of data assets across all sources. They help users to discover, understand, and consume more data productively while breaking down barriers to lakehouseadoption.

With a data catalog in place, data users can benefit from a unified view of diverse data sources, including structured and unstructured data, batch data, and streaming data, enabling them to easily find relevant data assets for a variety of use cases. Data catalogs can also be used to track lineage and show how data moves through the various stages of ingestion, transformation, and consumption in the lakehouse environment. 

Tools for implementing a data catalog in your lakehouse include:

  • Glue Data Catalog (AWS): Glue Data Catalog is a fully managed metadata repository and catalog service provided by AWS Glue. Part of the wider AWS product suite, it can integrate with various AWS services, data sources, and data processing tools for discovering, managing, and governing data assets within AWS environments. 
  • Unity Catalog (Databricks): Unity Catalog is a metadata management tool provided by Databricks that has been designed to work with the Databricks Unified Analytics Platform. Unity Catalog provides users with a unified overview of data and ML assets and supports lineage tracking and collaboration. Unity Catalog has been recently released as an open source project.
  • Polaris (Snowflake): Polaris is a new open source project by Snowflake providing an Iceberg compatible catalog that enables producers and consumers of data to collaborate and share on Iceberg tables.

Analytics Engines

Analytics engines enable the rapid querying and processing of large data volumes. Designed to handle distributed computing and parallel processing, analytics engines optimize queries by using processes like query planning, indexing, and data partitioning. The result is faster query response times and improved overall performance.

In addition, some analytics engines support advanced analytics functionalities such as window functions, aggregations, joins, and statistical functions, enabling complex data analysis and reporting. In most cases, analytics engines integrate seamlessly with data lakes, thereby eliminating the need to move data into a separate warehouse. 

Examples of analytics engines include:

  • Presto: An open-source distributed SQL query engine developed by Facebook, Presto is designed for high-performance queries over various data sources, including data lakes and databases. Features include support for ANSI SQL, a connector ecosystem, and distributed query execution. 
  • Athena: Amazon Athena is a serverless interactive query service provider that allows users to query data stored in Amazon S3 using standard SQL queries. Features include serverless architecture and full integration with the full suite of AWS services, including AWS Glue. 
  • ClickHouse: ClickHouse is an open-source columnar database management system for real-time analytics and big data processing. Features include column-orientated storage, real-time processing, and a distributed architecture across multiple nodes for horizontal scalability. 

Strengthen Your Lakehouse Foundations with Upsolver

Data lakehouses are quickly becoming the go-to solution for data teams that want to combine the tried-and-tested functionality of data warehouses with the reduced costs and scalability of data lakes. 

With such a high level of demand for what is still an emerging solution, there’s no shortage of third-party tools and solutions available that data teams can implement into their own lakehouse deployments. 

Although there’s nothing wrong with adopting a broad spectrum of tools to meet your lakehouse use case, keep in mind that less is often more, and you should look to implement dedicated lakehouse solutions like Upsolver that check multiple boxes. 

Doing so means that it’s not only easier to manage your lakehouse environment in the long term but also mitigates the risk of feature bloat and can benefit from reduced operational costs.  

If you want to learn more about how the data lakehouse is transforming data analytics, check out our ultimate guide

Want to see how your Iceberg tables stack up? Sign up (for free), add your catalog and see the results.

Published in: Blog , Data Lakes
Upsolver Team
Upsolver Team

Upsolver enables any data engineer to build continuous SQL data pipelines for cloud data lake. Our team of expert solution architects is always available to chat about your next data project. Get in touch

Keep up with the latest cloud best practices and industry trends

Get weekly insights from the technical experts at Upsolver.

Subscribe

Templates

All Templates

Explore our expert-made templates & start with the right one for you.