Upsolver July 2024 Feature Summary

Welcome to the July update from us here at Upsolver! As usual it has been another busy month as you’ll notice from the long fix list below and we also have two big and exciting feature announcements!

First up, you can now create ingestion jobs that write directly to Amazon Redshift databases, eliminating the need to write an ingestion job to bring the data in your lake prior to writing a transformation job to load it to Redshift. You can of course still do this, but having a single job is very efficient and requires less overhead and maintenance. Our zero-ETL Data Ingestion Wizard supports direct ingestion to Redshift, so you can achieve data movement without touching a line of code if this is the experience you want – and a great way to get started before building more complex solutions.

If you have tried querying externally managed Iceberg tables from Snowflake, you will no doubt be aware of the operational overhead of keeping the data refreshed so queries return timely results. The manual process requires writing and orchestrating tasks to ensure Snowflake is pointing to the latest snapshot. I’m delighted to share that we have taken this pain away and you can now mirror your Upsolver-managed Iceberg tables to Snowflake and we do the refresh for you. 

Once again, it’s hats off to our engineering team for another month of awesome work to deliver these new features alongside a load of enhancements and bug fixes. I had a sneak peek at some of the features coming up next and am happy to report the excitement doesn’t let up. If you have any questions about these release notes or want to give Upsolver a whirl, please reach out to our support team or book a no-obligation demo

The following releases are included in this update:

Contents

UI

NEW! Data Ingestion Wizard Support for Direct Ingestion to Amazon Redshift 

Great news! We have extended the capabilities of our zero-ETL Data Ingestion Wizard to enable you to create jobs ingesting directly to Amazon Redshift.This means you don’t need to write any code if you want a hands-off experience, or simply want to use the wizard as a starting point for building new jobs.

All you need to do is follow the step-by-step instructions in the wizard to connect to your source and target, and map the data between the two platforms. Check out the Quickstart guide in our documentation for further instructions.

NEW! Data Ingestion Wizard Maps Source and Targets Schemas for CDC Jobs

While we’re on the topic of our Data Ingestion Wizard, we have also added the ability to map selected source schemas to target schemas when creating replication (CDC) jobs. If target schemas do not already exist, they will be created automatically, meaning the source and target schemas will be identical.

This option was previously available only via SQL, and we have improved the experience when configuring source to target schema mapping in the UI to make this easier to achieve.

The wizard supports ingesting data from CDC-enabled databases, including Microsoft SQL Server, MongoDB, MySQL, and PostgreSQL.

Jobs

Fixed Issue with Multipart File Uploads to Amazon S3 Not Cleaned Up

We fixed an issue when incomplete multipart upload files were not cleaned up in jobs writing to Amazon S3.

Fixed Interval Issue in Non-Sync Jobs Writing to Apache Iceberg

Non-sync jobs reading from Apache Iceberg with an interval larger than one minute were not reading all data from the source table. This issue has now been resolved. 

Fixed Interval Issue in Jobs Writing to Hive Tables

Jobs writing to Hive tables with an interval equal to or larger than seven days could become stuck unless extra configuration was performed by the Upsolver support team. This is now fixed. 

Improvements to Iceberg Schema Evolution

Schema evolution improvements have been made for jobs writing to Apache Iceberg to assume new fields with a suffix and type match existing fields in the dataset.

Fixed the Bug Preventing Jobs Writing to Tables with End Time Marked Completed

Fixed a bug that prevented a job writing to a table with an end time from being marked as Completed when the end time was reached.

Fixed Issue with Jobs Writing to Iceberg Not Producing Rows 

An issue was discovered whereby jobs writing to Apache Iceberg tables did not produce any rows when the job’s source included a system table. We’re happy to report this has now been fixed. 

If you have not yet explored the wealth of metadata in the Upsolver system tables, please browse our documentation to learn more about the data we expose that you can utilize in your jobs.

Data Retention Support for External Iceberg Tables

As part of our continuous support and expansion of features for Apache Iceberg, we added support for setting the data retention policy on external Iceberg tables. External Iceberg tables are tables that are managed by Upsolver – we continuously compact and tune the data files – but are tables that are not a target of an Upsolver data ingestion job. 

Discover how to create external Iceberg tables in Upsolver.

Fixed the Dangling Files Issue from Failed Iceberg Compaction

We fixed the issue when Apache Iceberg was creating dangling files during failed compactions. 

Fixed the Shard Number in the Recent Compactions System Table

The incorrect shard number displayed in the system.monitoring.recent_compactions system table has now been resolved. The shard number is used for compaction.

CDC Jobs

Added Support for Truncating Columns Written to Amazon Redshift

We now support using the TRUNCATE_COLUMNS job option for jobs writing to an Amazon Redshift database. This option specifies whether VARCHAR or CHARACTER VARYING fields with a string length longer than the target column should be truncated to enable insert. 

Learn more about writing data to Amazon Redshift.

Replication Jobs Support COMMIT_INTERVAL Option for Amazon Redshift Targets

Last month we introduced replication jobs for Amazon Redshift targets. Replication jobs ingest data from a CDC-enabled database, such as SQL Server or PostgreSQL. These special jobs comprise one or more replication groups, whereby each group can be individually configured to write to different targets. This release includes the addition of the custom commit interval option that is applied at the replication group level, so targets can be updated only when required.

In the example below, the data is ingested from MongoDB once, but is written to two different schemas in Redshift. The inclusion of the COMMIT_INTERVAL option within the two replication groups enables the production database to be updated every 5 minutes, while the UAT schema is refreshed hourly:

CREATE REPLICATION JOB mongodb_replication_to_reshift
  COMMENT = 'Replicate MongoDB CDC data to Redshift groups'
  COMPUTE_CLUSTER = "Default Compute (Free)"
  INTERMEDIATE_STORAGE_CONNECTION = s3_connection
  INTERMEDIATE_STORAGE_LOCATION = 's3://upsolver-integration-tests/test/' 
FROM my_mongodb_connection 
WITH REPLICATION GROUP replicate_to_redshift_prod 
  INCLUDED_TABLES_REGEX = ('orders\..*')
  EXCLUDED_COLUMNS_REGEX = ('.*\.creditcard') -- exclude creditcard columns 
  COMMIT_INTERVAL = 5 MINUTES
  LOGICAL_DELETE_COLUMN = "is_deleted"
  REPLICATION_TARGET = my_redshift_connection
  TARGET_SCHEMA_NAME_EXPRESSION = 'ORDERS'
  TARGET_TABLE_NAME_EXPRESSION = $table_name
  WRITE_MODE = MERGE 
WITH REPLICATION GROUP replicate_to_redshift_uat
  INCLUDED_TABLES_REGEX = ('orders\..*')
  COMMIT_INTERVAL = 1 HOUR
  REPLICATION_TARGET = my_redshift_connection
  TARGET_TABLE_NAME_EXPRESSION = 'history_' || $table_name
  TARGET_SCHEMA_NAME_EXPRESSION = 'ORDERS_UAT'
  WRITE_MODE = APPEND;

Check out the documentation to learn more about creating replication jobs.

Intermediate Files for CDC Metadata Now Deleted

When ingesting CDC from a database platform, Upsolver uses intermediate files to store the metadata from the source. We have enhanced this process to correctly delete intermediate files that are no longer required.

SQL Queries

Removed $commit_time Column from System Tables 

We have removed the $commit_time column from all Upsolver system tables, so please ensure you check your queries for references to these columns. 

Query Iceberg Tables for Preview within Upsolver

Customers can now query Iceberg tables using Upsolver’s query engine for previewing data. Simply open a Worksheet in Upsolver to run a query against your Iceberg tables.

Enhancements

NEW! Mirror Iceberg Tables to Snowflake

We have added functionality to enable you to query your Upsolver-managed tables from Snowflake, and it’s super straightforward to set up. When you create a mirror of your Iceberg table in Snowflake, we maintain an external Iceberg table in Snowflake, keeping it up-to-date by pointing to the latest snapshot of the table. The steps are as follows:

  1. In Upsolver, create a connection to your Snowflake database
  2. Create a catalog integration in Snowflake
  3. In Snowflake, configure an Amazon S3 external volume
  4. Create your Upsolver-managed Iceberg table
  5. Create a mirror table in Snowflake
  6. Use Upsolver to create an ingestion job to populate your table

Follow the full guide on How to Query Upsolver Iceberg Tables from Snowflake, or check out the SQL reference to mirror Iceberg tables.

Apache Iceberg Partition Improvements

This month, we made two improvements for working with Apache Iceberg partitions. Firstly, we added support for creating Iceberg tables with hidden partitions using partition transforms. Second, we now support setting retention on partitions of type Long/Integer, enabling retention on hidden partitions that use time truncate transformations such as year/month/day/hour. 

Default Value Updated for External Iceberg Tables Updated

An external Iceberg table is a table that is not the target of an Upsolver ingestion job and, when you create an external Iceberg table using Upsolver, you can optionally specify that we manage the optimization for you. When you request optimization, we manage file compaction and performance tuning to ensure the table size is as small as possible to reduce storage costs, and that query scans are fast and efficient.

We have updated the default value for the OPTIMIZE property to FALSE when creating external Iceberg tables, but you can change this as required. In the example below, the statement creates an Iceberg table that Upsolver will compact and tune for you, until you set the value to FALSE.

CREATE EXTERNAL ICEBERG TABLE
    my_glue_catalog.my_database.my_external_iceberg_table
    OPTIMIZE = TRUE;

Alternatively, simply exclude the OPTIMIZE option when creating external Iceberg tables and we won’t optimize them.

Learn more creating and maintaining Iceberg tables in Upsolver.

Extended Options Available When Creating Iceberg Tables

A couple of crucial changes were made to the syntax for creating Upsolver-managed Iceberg tables. These are tables within Iceberg that are a destination for an Upsolver ingestion job, and are automatically optimized by us to maintain an optimal table (file) size to reduce storage costs and ensure queries are performant.

This month we extended the syntax for creating Iceberg tables to enable customers to configure any table property documented by Apache Iceberg, simply by including the iceberg prefix in the syntax, followed by the property: 

CREATE ICEBERG TABLE default_glue_catalog.my_database.orders_data
(
   order_id string,
   order_date date,
   customer_email string,
   net_total bigint, 
   num_items bigint,
   customer_address string,
   customer_city string,
   customer_state string,
   customer_zipcode bigint
)
   PARTITIONED BY order_date
   PRIMARY KEY order_id
   ORDER BY net_total DESC, order_date ASC
   STORAGE_CONNECTION = s3_connection
   STORAGE_LOCATION = 's3://bucket/storage_location'
   COMPUTE_CLUSTER = "my cluster"
   DISABLE_COMPACTION = false
   iceberg."read.split.planning-lookback" = '10'
   COMMENT = 'Orders table';

The other change is that we removed the COLUMN_DATA_RETENTION property when creating Iceberg tables, as this property is only supported in Upsolver-managed Hive tables.

Please see our documentation to learn more about creating and maintaining Iceberg tables.

Number of Upload Files Part Requests Reduced for Amazon S3

To increase job performance, we reduced the number of upload file part requests to Amazon S3 by using larger parts. 

Temporary Files Created for Ingestion Jobs Now Deleted After Use

We implemented the deletion of temporary intermediate files in ingestion jobs after their usage. This feature is being enabled gradually. 

Support Added for Splitting Files Across Shards Within a Partition

Output files can now be split across shards within a partition for efficiency and performance.  

Performance Improvements Made to Iceberg Table Compaction

Finally, we have improved the compaction speed for sorted Iceberg tables, so you should feel the benefits on your Upsolver-managed tables following our extensive tuning process.


So that’s everything for July! I’ll be back in August, but before then, please follow us on LinkedIn for all the latest news and announcements, or book a demo if you have not yet given Upsolver a test drive. 

Published in: Blog , Release Notes
Rachel Horder
Rachel Horder

Rachel spent the first half of her career building websites using .NET and later shifted her focus to SQL Server, honing her expertise in data migration and integration. Driven by her enthusiasm for data and ETL processes, she co-authored the book "What's New in SQL Server 2012" and contributed to writing the official training material for Microsoft SQL Server 2016 and Power BI. In her current role as Senior Technical Writer at Upsolver, Rachel continues to combine her passions for data and writing.

Keep up with the latest cloud best practices and industry trends

Get weekly insights from the technical experts at Upsolver.

Subscribe

Templates

All Templates

Explore our expert-made templates & start with the right one for you.