Upsolver August 2024 Feature Summary

Rachel Horder
Release Notes
August 5, 2024

You’ll notice from the list of links below that our engineering team has well and truly smashed it out the park this month! While we continue adding new features to support emerging technologies and ensure we have your technical requirements taken care of, we are always looking for ways to improve the Upsolver UI, simplify data ingestion processes, and ensure your lakehouse operates at optimal performance. And of course the team is committed to crushing every bug as soon as they can.

Ok, I know I’m biased, but this month has been exceptional. We added support for the newly released Polaris Catalog from Snowflake, so you can now use Polaris as a target connection for your ingestion jobs. If, like the rest of us, you’re still trying to get your head around Iceberg and the multitude of catalogs now available, then we have some awesome resources to help. Our August webinar, Getting Started with Snowflake Polaris and Iceberg Tables, will build your knowledge and guide you through the first steps of working with these new technologies. Also, our Iceberg Academy continues to grow, and here you’ll find practical, in-depth training material that is delivered clearly and concisely.

Furthermore, we added the unified observability platform, Dynatrace, as an output for your monitoring jobs. We already offer built-in monitoring and observability in Upsolver, however, you can roll your own jobs to extract the exact stats you want from our system tables and send them to Dynatrace. Our system catalog exposes a wealth of metadata for you to query at will, and I highly recommend you take a look at the documentation and see what is available – even if you don’t use third-party monitoring, the metadata will help you detect and mitigate issues.

We have also made great strides to enhance jobs that ingest from database CDC sources. Replication jobs are a fantastic way for directing data from a single source into multiple destinations, and there are a few exciting (if you like that sort of thing as much as I do!!) developments to be discovered below.

Phew, after all this, even if our engineers don’t need a vacation, I know I do! But, before I pour myself a mojito, can I recommend you check out Chill Data Summit on Tour? CDS is about to head out on the road and is coming to a city near you. This time around, the format is a little different, with a morning of keynotes and talks, followed by hands-on training to build your first Apache Iceberg lakehouse.

The first stop is San Francisco Bay Area on August 15, and the afternoon training will be led by none other than Holden Karau! Tickets are going fast, so don’t hesitate to sign up because it’s going to be awesome!! In the meantime, have a great summer, and I’ll be back in September with another update. Now it’s mojito time! Cheers!

This update includes the following releases:

UI
- Job Monitoring Page Redesigned for Better Execution Tracking
- Configure Snapshot Parallelism for CDC Jobs from Wizard
Jobs
CDC Jobs
SQL Queries
Enhancements

UI

Job Monitoring Page Redesigned for Better Execution Tracking

The job monitoring page has been thoughtfully redesigned, allowing you to track job statuses by executions, representing each data interval being processed.

In the example below, the job extracts data from Apache kafka every minute, but writes to the target Snowflake database every six hours:

By splitting out each write operation into an individual, trackable row in the reporting table, you can now view the metrics for each job execution with finer granularity. It is much easier to monitor the volume of data discovered at each execution interval, and check that the number of messages discovered matches the number of rows parsed. Alerts are also displayed at the interval level, enabling you to troubleshoot problems much quicker than before.

Check out the documentation where you can dive into the metrics on this page and learn more about job monitoring.

Configure Snapshot Parallelism for CDC Jobs from Wizard

Our zero-ETL ingestion tool makes it quick and simple to ingest data from source to target. When it comes to CDC sources, we facilitate ingesting data from SQL Server, MongoDB, MySQL, and PostgreSQL in just a few easy to follow steps. Prior to this update, you would need to manually code an ingestion job to configure snapshot parallelism outside of the default settings.

Now, with this release, you can configure snapshot parallelism from within our data ingestion wizard, helping you to experience the best possible performance without needing to write a line of code.

Configure snapshot replication within the Data Ingestion Wizard

Jobs

NEW! Snowflake Polaris Catalog Now Supported

Snowflake’s Polaris Catalog enables open, secure, lakehouse architectures with broad read-and-write interoperability and cross-engine access controls. Polaris Catalog is open source under the Apache 2.0 license and now available on GitHub, while Snowflake’s managed service for Polaris Catalog is now available in public preview.

I’m delighted to announce that you can now configure Polaris Catalog as your default Iceberg Lakehouse catalog, and ingest data from databases, streams, and files into your Iceberg lake with only a few clicks. You can of course take advantage of our Adaptive Optimizer to maximize Lakehouse query performance and cost savings.

Learn more about configuring Polaris Catalog with Upsolver.

NEW! Dynatrace Added as Monitoring Output

I’m excited to announce that we now support Dynatrace as a monitoring output, along with Amazon CloudWatch, and Datadog.

This means you can create jobs in Upsolver that send metrics to your chosen reporting tool to monitor your Upsolver pipelines from a centralized dashboard.

In the example below, we create a job named send_monitoring_data_to_dynatrace that sends metrics from the Upsolver system.monitoring.jobs table to Dynatrace.

CREATE JOB send_monitoring_data_to_dynatrace 
    START_FROM = NOW 
AS INSERT INTO my_dynatrace_connection
    MAP_COLUMNS_BY_NAME
       SELECT job_id AS tags.id, 
              job_name AS tags.job_name, 
              discovered_files_lifetime AS files,
              avg_file_size_today AS avg_file_size,
              avg_rows_scanned_per_execution_today AS avg_rows_scanned,
              cluster_id AS tags.cluster_id,
              cluster_name AS tags.cluster_name,              
              RUN_START_TIME() AS time
       FROM system.monitoring.jobs;

Learn how to connect to your Dynatrace account, and create a job to continuously send metrics.

Table Retention Column Types Updated for Apache Iceberg Tables

When creating an Upsolver-managed Iceberg table – Upsolver manages the compaction and tuning of the data and metadata files for you – you can include table options to specify the column that should be used for determining the retention, alongside the number of days to retain the data for.

Since we initially released this feature, we have expanded the table option to enable you to set the data retention based on any column of types date, timestamp, timestampz, long, or integer.

If you have already implemented data retention in your Iceberg tables, please note that we have also changed the configuration syntax. Please see the update below, which includes a code example. Alternatively, learn how easy it is to create an Iceberg table with a defined data retention.

Fixed a Bug Causing Redundant Columns in Schema Evolution

We have fixed a bug occurring in jobs writing to Amazon Redshift and Snowflake targets, that created redundant columns when the schema evolved. If the job syntax was using SELECT * in the query, when mapping fields to different types, such as mapping a field of type timestamp to a column of type timestamptz, a redundant column would be produced. Customers should no longer experience this unwanted addition.

Apache Iceberg Dangling Files Issue Fixed

During Iceberg operations, files may become unreferenced or what is known as “dangling”. If these files are left to accumulate, it can lead to increased storage costs. Upsolver performs a daily clean-up of any dangling files to help maintain a tidy and cost-effective storage environment.

A problem was recently discovered with these dangling files, which has now been fixed by moving the backup folder location into the table root directory.

Number of Catalogs Requests Reduced to Prevent Errors

We have reduced the number of requests to the catalog to prevent rate-exceeded errors when managing a large number of tables.

Fixed Internal Files Being Deleted Before Being Used

We fixed the bug that caused internal files to be deleted before their usage, resulting in the job stalling.

Bug Fixed to Exclude System Columns

The bug that prevented system columns from being referenced in the EXCLUDE_COLUMNS property in ingestion jobs has been fixed.

Fixed Issue When Using System Columns in Ingestion Jobs

Sometimes it is necessary to transform data before it lands in your target, perhaps because data should be masked for privacy purposes, or requires shaping before it lands in the destination. The COLUMN_TANSFORMATIONS job option provides a flexible solution for transforming your in-flight data, but we recently encountered a bug that prevented system columns from being used with this option. I’m delighted to say that we have now fixed this issue.

Bug Resolved for Primary Key Numeric Type

A bug was occurring when the primary key was a numeric type and the table was large enough to require chunk splitting.

Fixed Scheduling Bug when Compacting Partitions

We fixed a bug where it was possible to schedule compactions for partitions in tables without a primary key, even after the retention period, leading to errors.

Fixed Issue When Ingesting Large CSV Files from S3 to Iceberg

An issue occurred when creating jobs ingesting from Amazon S3 with large CSV files, to Iceberg tables.

Compaction Preventing Reading from Hive Tables Fixed

An issue arose in jobs reading data from Hive tables, where data loss occurred due to committing compacted files in the source table with the wrong timing.

Bug Fixed Where Data Retention Failed in Old Partitions

In rare conditions in tables with a data retention policy, background jobs failed to delete some of the files in old partitions. This is now fixed.

Fixed Issue with Incorrect Job Statistics When Writing to Iceberg

In jobs writing to Iceberg with an interval longer than 1 minute, job statistics were not being displayed in the monitoring page. This issue has been addressed and the statistics are now visible.

Join Expressions Issue Affecting Job Monitoring is Fixed

The job monitoring page was also affected by a bug in jobs writing to Iceberg tables that included a JOIN expression. This is now resolved.

CDC Jobs

Support Added to Re-Snapshot MongoDB Collections

If you have replication jobs reading changes (CDC) from a MongoDB database, you’ll be pleased to hear we now support the ability to re-snapshot a collection.

Replication jobs are a powerful solution for ingesting data from a CDC source to multiple targets. Each target, defined in a replication group, functions independently of the other groups within the job, despite sharing a dataset.

If you want to ingest your sales data from an operational, CDC-enabled database, you can send one set of data to your production target to be used by stakeholders for reporting, and the other set can be copied to a development target, to be used in a raw format by your data engineers.

Let’s look at an example of how this works. In the example below, we create a replication job named replicate_mongodb_to_snowflake, that outputs to two targets, each defined within its own replication group: replicate_to_snowflake_prod and replicate_to_snowflake_dev. Each group is configured to support a different use case, but both read from the orders schema.

CREATE REPLICATION JOB replicate_mongodb_to_snowflake
  COMMENT = 'Replicate MongoDB CDC data to Snowflake groups'
  COMPUTE_CLUSTER = "Default Compute (Free)"
  INTERMEDIATE_STORAGE_CONNECTION = s3_connection
  INTERMEDIATE_STORAGE_LOCATION = 's3://upsolver-integration-tests/test/' 
FROM my_mongodb_connection 
WITH REPLICATION GROUP replicate_to_snowflake_prod 
  INCLUDED_TABLES_REGEX = ('orders\..*')
  EXCLUDED_COLUMNS_REGEX = ('.*\.creditcard') -- exclude creditcard columns 
  COMMIT_INTERVAL = 5 MINUTES
  LOGICAL_DELETE_COLUMN = "is_deleted"
  REPLICATION_TARGET = my_snowflake_connection
  TARGET_SCHEMA_NAME_EXPRESSION = 'ORDERS'
  TARGET_TABLE_NAME_EXPRESSION = $table_name
  WRITE_MODE = MERGE 
WITH REPLICATION GROUP replicate_to_snowflake_dev 
  INCLUDED_TABLES_REGEX = ('orders\..*')
  COMMIT_INTERVAL = 1 HOUR
  REPLICATION_TARGET = my_snowflake_connection
  TARGET_TABLE_NAME_EXPRESSION = 'history_' || $table_name
  TARGET_SCHEMA_NAME_EXPRESSION = 'ORDERS_DEV'
  WRITE_MODE = APPEND;

You can then use the RESNAPSHOT COLLECTION command to re-take a snapshot of the orders_collection that is used by the replicate_mongodb_to_snowflake job.

ALTER JOB replicate_mongodb_to_snowflake
RESNAPSHOT COLLECTION orders_collection

For more information, please read the article Performing Snapshots, which explains how the process works in Upsolver. This article is also applicable if you are replicating data from SQL Server, MySQL, or PostgreSQL.

Configure Primary Key Column in Replication Groups

As mentioned above, replication jobs utilize special replication groups to copy a single dataset to multiple locations, each using its own set of configurations. We added a new property PRIMARY_KEY_COLUMN to replication groups to enable you to configure the primary key column within each replication group.

Fixed a Bug in CDC to Redshift Jobs Creating Columns of Incorrect Type

We fixed a bug occurring in jobs ingesting CDC data to Amazon Redshift, which caused the creation of columns with incorrect types.

Fixed the Bug Creating Additional Column in Redshift Target

We also fixed the bug that caused an additional column to be created when writing a field of type timestamp to a column of type timestampTZ in Amazon Redshift.

SQL Queries

Data Retention Syntax Properties Changed for Iceberg Tables

If you are using the retention options in your Upsolver-managed Apache Iceberg tables, please be aware that we have changed the configuration properties. As you may have read above in the Jobs section, since we first released the retention option, we have expanded the supported column types to include date, timestamp, timestampz, long, or integer. To enable this expansion, we have updated the property TABLE_DATA_RETENTION to RETENTION_DURATION, and RETENTION_DATE_PARTITION to RETENTION_COLUMN.

For example, to create a table named orders_data in Apache Iceberg, with a retention of 7 DAYS, based on the order_date column, use the properties as follows:

CREATE ICEBERG TABLE default_glue_catalog.my_database.orders_data
(
    order_id string,
    order_date date,
    customer_email string,
    net_total bigint, 
    num_items bigint,
    customer_address string,
    customer_city string,
    customer_state string,
    customer_zipcode bigint
)
    PARTITIONED BY order_date
    RETENTION_COLUMN = order_date
    RETENTION_DURATION = 7 DAYS
    PRIMARY KEY order_id
    ORDER BY net_total DESC, order_date ASC
    STORAGE_CONNECTION = s3_connection
    STORAGE_LOCATION = 's3://bucket/storage_location'
    COMPUTE_CLUSTER = "my cluster"
    DISABLE_COMPACTION = false
    iceberg."read.split.planning-lookback" = '10'
    COMMENT = 'Orders table';

[Link to Jobs Item]

Learn how to create Apache Iceberg tables with Upsolver.

START_FROM Syntax Update to Jobs Ingesting from Amazon S3

The START_FROM job option instructs Upsolver as to the point in time to begin ingesting data. You can use NOW, BEGINNING, or a timestamp for a specific date and time.

We have updated the syntax for jobs ingesting from Amazon S3, so you cannot include the START_FROM option, unless the DATE_PATTERN option is also specified.

Please read the documentation for in-depth help on using DATE_PATTERN.

Fixed Casing Issue in Mapped Columns for MERGE Jobs

Fixed a bug in MERGE jobs where the ON condition had a different case than the mapped column in the SELECT clause

Fixing Bug in JOIN Expression Returning NULLs

Fixed a bug in jobs writing to a table where a JOIN expression with an uppercase alias caused the joined row to return NULLs in all fields

Fixed Multi Fields Issue in Column Transformations to Iceberg

We fixed a problem in jobs writing to Iceberg that included the COLUMN_TRANSFORMATIONS option. When the column transformation overrode a field, which appeared with a different case in the data, this resulted in multiple fields of different types.

Enhancements

Table Refresh Interval Added to Apache Iceberg Mirror Tables

We recently announced support for Iceberg Table Mirroring to Snowflake, a new feature that ensures data engineers view the freshest data when querying their Iceberg tables from Snowflake. We automated a manual process that would otherwise demand on-going effort to update the snapshot in Snowflake every time a change is made in Iceberg.

Our table mirroring feature mirrors data in Iceberg to a table in Snowflake, so that users work with the newest data, engineers no longer need to orchestrate tasks to update the snapshot in Snowflake, and the refresh operation is only performed when changes are made, thereby eliminating unnecessary costs.

We have added a new table option that defines or alters the mirror table refresh interval. Using MIRROR_INTERVAL, you can specify how frequently Upsolver refreshes the data.

The default mirror interval value is one minute, but you can set this according to your requirements. For example, we can create a table in Snowflake that is updated hourly:

CREATE MIRROR FOR default_glue_catalog.demo.my_iceberg_table
    IN my_snowflake_connection.mirror_demo_schema.mirror_iceberg_tbl
    CATALOG_INTEGRATION = my_catalog_int
    EXTERNAL_VOLUME = iceberg_volume
    MIRROR_INTERVAL = 1 HOUR;

Follow this guide for an end-to-end walkthrough on how to Query Upsolver Iceberg Tables from Snowflake.

Support Added for Iceberg REST Catalog Connections and Tables

Users can now create Iceberg REST catalog connections and add tables.

Performance Improvements Made to High-Load Clusters

We have implemented several performance improvements to high-load clusters to ensure your experience is optimal.

Partition Evolution Supported for Iceberg Tables

We now support the ability to dynamically change or disable the partition columns of your Iceberg tables. When changing the partition, this does not rebuild the existing data according to the new partitioning but will apply the new partitioning to data moving forward.

In the example below, to change the partition column of the web_orders table to order_date, simply use the following command:

ALTER TABLE web_orders PARTITION BY order_date;

To stop partitioning the web_orders table, use the following syntax:

ALTER TABLE web_orders PARTITION BY ();

Sorting Evolution is Now Supported for Iceberg Tables

We also support the ability to change or disable the sort order of data in your Iceberg tables. This enables you to dynamically change the sorting columns of an existing table, and affects future data that is written, and new compactions will write data using the new sorting.

To change the sort column of the web_orders table from shipped_date to order_date, simply use the following command:

ALTER TABLE web_orders ORDER BY order_date;

Optionally, you can re-sort the existing data that was sorted by shipped_date, to be sorted by order_date, using the OPTIMIZE function:

OPTIMIZE TABLE web_orders REWRITE DATA;

To remove all sorting from the web_orders table, run the following:

ALTER TABLE web_orders  ORDER BY();

Learn more about altering your Iceberg tables.

Create Cluster Using VPC Connection Property

If you have multiple VPC connections in your organization, you can now use the VPC_CONNECTION property with the CREATE CLUSTER syntax to specify which VPC to use. Running this command without the option will generate an error if you have multiple VPC connections.

Support Credential Vending from Iceberg REST Catalogs

Credential vending is a mechanism that provides dynamic and secure access to data storage systems, such as cloud object stores, e.g., AWS S3. This process is crucial in environments where security and data access control are of paramount importance. Rather than using static credentials that are hard-coded into the application or configuration files, credential vending involves dynamically generating or retrieving credentials when needed. This reduces the risk of credential leakage and ensures that access can be controlled and audited more effectively.

In Apache Iceberg, credential vending might be used to securely access data stored in external storage systems. For example, when a user or application interacts with Iceberg tables stored in a cloud object store, the Iceberg system can use credential vending to obtain the necessary credentials at runtime. This allows Iceberg to manage data securely without embedding sensitive information in the application code or configuration.

Reduced Overhead of Task Discovery

We have reduced the overhead of task discovery, especially in compute clusters with a lot of assigned entities, or a large number of shards

Support Added for Creating a Cluster with a Startup Script

You can now create a cluster that uses a startup script. We added the STARTUP_SCRIPT option, which can be used with both CREATE CLUSTER and ALTER CLUSTER.

Thanks for making it through this very long list of updates, I hope you found them useful. If you have any questions, please reach out to our friendly support team, who will be happy to help.

I’ll be back in the Fall, but in the meantime, please follow Upsolver on LinkedIn for the latest news, or book a demo if you fancy a test drive.

Published in: Blog , Release Notes

Rachel Horder

Rachel spent the first half of her career building websites using .NET and later shifted her focus to SQL Server, honing her expertise in data migration and integration. Driven by her enthusiasm for data and ETL processes, she co-authored the book "What's New in SQL Server 2012" and contributed to writing the official training material for Microsoft SQL Server 2016 and Power BI. In her current role as Senior Technical Writer at Upsolver, Rachel continues to combine her passions for data and writing.

Contents

UI

Job Monitoring Page Redesigned for Better Execution Tracking

Configure Snapshot Parallelism for CDC Jobs from Wizard

Jobs

NEW! Snowflake Polaris Catalog Now Supported

NEW! Dynatrace Added as Monitoring Output

Table Retention Column Types Updated for Apache Iceberg Tables

Fixed a Bug Causing Redundant Columns in Schema Evolution

Apache Iceberg Dangling Files Issue Fixed

Number of Catalogs Requests Reduced to Prevent Errors

Fixed Internal Files Being Deleted Before Being Used

Bug Fixed to Exclude System Columns

Fixed Issue When Using System Columns in Ingestion Jobs

Bug Resolved for Primary Key Numeric Type

Fixed Scheduling Bug when Compacting Partitions

Fixed Issue When Ingesting Large CSV Files from S3 to Iceberg

Compaction Preventing Reading from Hive Tables Fixed

Bug Fixed Where Data Retention Failed in Old Partitions

Fixed Issue with Incorrect Job Statistics When Writing to Iceberg

Join Expressions Issue Affecting Job Monitoring is Fixed

CDC Jobs

Support Added to Re-Snapshot MongoDB Collections

Configure Primary Key Column in Replication Groups

Fixed a Bug in CDC to Redshift Jobs Creating Columns of Incorrect Type

Fixed the Bug Creating Additional Column in Redshift Target

SQL Queries

Data Retention Syntax Properties Changed for Iceberg Tables

START_FROM Syntax Update to Jobs Ingesting from Amazon S3

Fixed Casing Issue in Mapped Columns for MERGE Jobs

Fixing Bug in JOIN Expression Returning NULLs

Fixed Multi Fields Issue in Column Transformations to Iceberg

Enhancements

Table Refresh Interval Added to Apache Iceberg Mirror Tables

Support Added for Iceberg REST Catalog Connections and Tables

Performance Improvements Made to High-Load Clusters

Partition Evolution Supported for Iceberg Tables

Sorting Evolution is Now Supported for Iceberg Tables

Create Cluster Using VPC Connection Property

Support Credential Vending from Iceberg REST Catalogs

Reduced Overhead of Task Discovery

Support Added for Creating a Cluster with a Startup Script

Keep up with the latest cloud best practices and industry trends

Subscribe

All Templates