Explore our expert-made templates & start with the right one for you.
Upsolver July 2024 Feature Summary
Welcome to the July update from us here at Upsolver! As usual it has been another busy month as you’ll notice from the long fix list below and we also have two big and exciting feature announcements!
First up, you can now create ingestion jobs that write directly to Amazon Redshift databases, eliminating the need to write an ingestion job to bring the data in your lake prior to writing a transformation job to load it to Redshift. You can of course still do this, but having a single job is very efficient and requires less overhead and maintenance. Our zero-ETL Data Ingestion Wizard supports direct ingestion to Redshift, so you can achieve data movement without touching a line of code if this is the experience you want – and a great way to get started before building more complex solutions.
If you have tried querying externally managed Iceberg tables from Snowflake, you will no doubt be aware of the operational overhead of keeping the data refreshed so queries return timely results. The manual process requires writing and orchestrating tasks to ensure Snowflake is pointing to the latest snapshot. I’m delighted to share that we have taken this pain away and you can now mirror your Upsolver-managed Iceberg tables to Snowflake and we do the refresh for you.
Once again, it’s hats off to our engineering team for another month of awesome work to deliver these new features alongside a load of enhancements and bug fixes. I had a sneak peek at some of the features coming up next and am happy to report the excitement doesn’t let up. If you have any questions about these release notes or want to give Upsolver a whirl, please reach out to our support team or book a no-obligation demo.
The following releases are included in this update:
Contents
- UI
- Jobs
- Fixed Issue with Multipart File Uploads to Amazon S3 Not Cleaned Up
- Fixed Interval Issue in Non-Sync Jobs Writing to Apache Iceberg
- Fixed Interval Issue in Jobs Writing to Hive Tables
- Improvements to Iceberg Schema Evolution
- Fixed the Bug Preventing Jobs Writing to Tables with End Time Marked Completed
- Fixed Issue with Jobs Writing to Iceberg Not Producing Rows
- Data Retention Support for External Iceberg Tables
- Fixed the Dangling Files Issue from Failed Iceberg Compaction
- Fixed the Shard Number in the Recent Compactions System Table
- CDC Jobs
- SQL Queries
- Enhancements
- NEW! Mirror Iceberg Tables to Snowflake
- Apache Iceberg Partition Improvements
- Default Value Updated for External Iceberg Tables Updated
- Extended Options Available When Creating Iceberg Tables
- Number of Upload Files Part Requests Reduced for Amazon S3
- Temporary Files Created for Ingestion Jobs Now Deleted After Use
- Support Added for Splitting Files Across Shards Within a Partition
- Performance Improvements Made to Iceberg Table Compaction
UI
NEW! Data Ingestion Wizard Support for Direct Ingestion to Amazon Redshift
Great news! We have extended the capabilities of our zero-ETL Data Ingestion Wizard to enable you to create jobs ingesting directly to Amazon Redshift.This means you don’t need to write any code if you want a hands-off experience, or simply want to use the wizard as a starting point for building new jobs.
All you need to do is follow the step-by-step instructions in the wizard to connect to your source and target, and map the data between the two platforms. Check out the Quickstart guide in our documentation for further instructions.
NEW! Data Ingestion Wizard Maps Source and Targets Schemas for CDC Jobs
While we’re on the topic of our Data Ingestion Wizard, we have also added the ability to map selected source schemas to target schemas when creating replication (CDC) jobs. If target schemas do not already exist, they will be created automatically, meaning the source and target schemas will be identical.
This option was previously available only via SQL, and we have improved the experience when configuring source to target schema mapping in the UI to make this easier to achieve.
The wizard supports ingesting data from CDC-enabled databases, including Microsoft SQL Server, MongoDB, MySQL, and PostgreSQL.
Jobs
Fixed Issue with Multipart File Uploads to Amazon S3 Not Cleaned Up
We fixed an issue when incomplete multipart upload files were not cleaned up in jobs writing to Amazon S3.
Fixed Interval Issue in Non-Sync Jobs Writing to Apache Iceberg
Non-sync jobs reading from Apache Iceberg with an interval larger than one minute were not reading all data from the source table. This issue has now been resolved.
Fixed Interval Issue in Jobs Writing to Hive Tables
Jobs writing to Hive tables with an interval equal to or larger than seven days could become stuck unless extra configuration was performed by the Upsolver support team. This is now fixed.
Improvements to Iceberg Schema Evolution
Schema evolution improvements have been made for jobs writing to Apache Iceberg to assume new fields with a suffix and type match existing fields in the dataset.
Fixed the Bug Preventing Jobs Writing to Tables with End Time Marked Completed
Fixed a bug that prevented a job writing to a table with an end time from being marked as Completed when the end time was reached.
Fixed Issue with Jobs Writing to Iceberg Not Producing Rows
An issue was discovered whereby jobs writing to Apache Iceberg tables did not produce any rows when the job’s source included a system table. We’re happy to report this has now been fixed.
If you have not yet explored the wealth of metadata in the Upsolver system tables, please browse our documentation to learn more about the data we expose that you can utilize in your jobs.
Data Retention Support for External Iceberg Tables
As part of our continuous support and expansion of features for Apache Iceberg, we added support for setting the data retention policy on external Iceberg tables. External Iceberg tables are tables that are managed by Upsolver – we continuously compact and tune the data files – but are tables that are not a target of an Upsolver data ingestion job.
Discover how to create external Iceberg tables in Upsolver.
Fixed the Dangling Files Issue from Failed Iceberg Compaction
We fixed the issue when Apache Iceberg was creating dangling files during failed compactions.
The incorrect shard number displayed in the system.monitoring.recent_compactions system table has now been resolved. The shard number is used for compaction.
CDC Jobs
Added Support for Truncating Columns Written to Amazon Redshift
We now support using the TRUNCATE_COLUMNS
job option for jobs writing to an Amazon Redshift database. This option specifies whether VARCHAR
or CHARACTER VARYING
fields with a string length longer than the target column should be truncated to enable insert.
Learn more about writing data to Amazon Redshift.
Replication Jobs Support COMMIT_INTERVAL Option for Amazon Redshift Targets
Last month we introduced replication jobs for Amazon Redshift targets. Replication jobs ingest data from a CDC-enabled database, such as SQL Server or PostgreSQL. These special jobs comprise one or more replication groups, whereby each group can be individually configured to write to different targets. This release includes the addition of the custom commit interval option that is applied at the replication group level, so targets can be updated only when required.
In the example below, the data is ingested from MongoDB once, but is written to two different schemas in Redshift. The inclusion of the COMMIT_INTERVAL
option within the two replication groups enables the production database to be updated every 5 minutes, while the UAT schema is refreshed hourly:
CREATE REPLICATION JOB mongodb_replication_to_reshift
COMMENT = 'Replicate MongoDB CDC data to Redshift groups'
COMPUTE_CLUSTER = "Default Compute (Free)"
INTERMEDIATE_STORAGE_CONNECTION = s3_connection
INTERMEDIATE_STORAGE_LOCATION = 's3://upsolver-integration-tests/test/'
FROM my_mongodb_connection
WITH REPLICATION GROUP replicate_to_redshift_prod
INCLUDED_TABLES_REGEX = ('orders\..*')
EXCLUDED_COLUMNS_REGEX = ('.*\.creditcard') -- exclude creditcard columns
COMMIT_INTERVAL = 5 MINUTES
LOGICAL_DELETE_COLUMN = "is_deleted"
REPLICATION_TARGET = my_redshift_connection
TARGET_SCHEMA_NAME_EXPRESSION = 'ORDERS'
TARGET_TABLE_NAME_EXPRESSION = $table_name
WRITE_MODE = MERGE
WITH REPLICATION GROUP replicate_to_redshift_uat
INCLUDED_TABLES_REGEX = ('orders\..*')
COMMIT_INTERVAL = 1 HOUR
REPLICATION_TARGET = my_redshift_connection
TARGET_TABLE_NAME_EXPRESSION = 'history_' || $table_name
TARGET_SCHEMA_NAME_EXPRESSION = 'ORDERS_UAT'
WRITE_MODE = APPEND;
Check out the documentation to learn more about creating replication jobs.
Intermediate Files for CDC Metadata Now Deleted
When ingesting CDC from a database platform, Upsolver uses intermediate files to store the metadata from the source. We have enhanced this process to correctly delete intermediate files that are no longer required.
SQL Queries
Removed $commit_time Column from System Tables
We have removed the $commit_time
column from all Upsolver system tables, so please ensure you check your queries for references to these columns.
Query Iceberg Tables for Preview within Upsolver
Customers can now query Iceberg tables using Upsolver’s query engine for previewing data. Simply open a Worksheet in Upsolver to run a query against your Iceberg tables.
Enhancements
NEW! Mirror Iceberg Tables to Snowflake
We have added functionality to enable you to query your Upsolver-managed tables from Snowflake, and it’s super straightforward to set up. When you create a mirror of your Iceberg table in Snowflake, we maintain an external Iceberg table in Snowflake, keeping it up-to-date by pointing to the latest snapshot of the table. The steps are as follows:
- In Upsolver, create a connection to your Snowflake database
- Create a catalog integration in Snowflake
- In Snowflake, configure an Amazon S3 external volume
- Create your Upsolver-managed Iceberg table
- Create a mirror table in Snowflake
- Use Upsolver to create an ingestion job to populate your table
Follow the full guide on How to Query Upsolver Iceberg Tables from Snowflake, or check out the SQL reference to mirror Iceberg tables.
Apache Iceberg Partition Improvements
This month, we made two improvements for working with Apache Iceberg partitions. Firstly, we added support for creating Iceberg tables with hidden partitions using partition transforms. Second, we now support setting retention on partitions of type Long/Integer
, enabling retention on hidden partitions that use time truncate transformations such as year/month/day/hour.
Default Value Updated for External Iceberg Tables Updated
An external Iceberg table is a table that is not the target of an Upsolver ingestion job and, when you create an external Iceberg table using Upsolver, you can optionally specify that we manage the optimization for you. When you request optimization, we manage file compaction and performance tuning to ensure the table size is as small as possible to reduce storage costs, and that query scans are fast and efficient.
We have updated the default value for the OPTIMIZE
property to FALSE when creating external Iceberg tables, but you can change this as required. In the example below, the statement creates an Iceberg table that Upsolver will compact and tune for you, until you set the value to FALSE.
CREATE EXTERNAL ICEBERG TABLE
my_glue_catalog.my_database.my_external_iceberg_table
OPTIMIZE = TRUE;
Alternatively, simply exclude the OPTIMIZE
option when creating external Iceberg tables and we won’t optimize them.
Learn more creating and maintaining Iceberg tables in Upsolver.
Extended Options Available When Creating Iceberg Tables
A couple of crucial changes were made to the syntax for creating Upsolver-managed Iceberg tables. These are tables within Iceberg that are a destination for an Upsolver ingestion job, and are automatically optimized by us to maintain an optimal table (file) size to reduce storage costs and ensure queries are performant.
This month we extended the syntax for creating Iceberg tables to enable customers to configure any table property documented by Apache Iceberg, simply by including the iceberg prefix in the syntax, followed by the property:
CREATE ICEBERG TABLE default_glue_catalog.my_database.orders_data
(
order_id string,
order_date date,
customer_email string,
net_total bigint,
num_items bigint,
customer_address string,
customer_city string,
customer_state string,
customer_zipcode bigint
)
PARTITIONED BY order_date
PRIMARY KEY order_id
ORDER BY net_total DESC, order_date ASC
STORAGE_CONNECTION = s3_connection
STORAGE_LOCATION = 's3://bucket/storage_location'
COMPUTE_CLUSTER = "my cluster"
DISABLE_COMPACTION = false
iceberg."read.split.planning-lookback" = '10'
COMMENT = 'Orders table';
The other change is that we removed the COLUMN_DATA_RETENTION property when creating Iceberg tables, as this property is only supported in Upsolver-managed Hive tables.
Please see our documentation to learn more about creating and maintaining Iceberg tables.
Number of Upload Files Part Requests Reduced for Amazon S3
To increase job performance, we reduced the number of upload file part requests to Amazon S3 by using larger parts.
Temporary Files Created for Ingestion Jobs Now Deleted After Use
We implemented the deletion of temporary intermediate files in ingestion jobs after their usage. This feature is being enabled gradually.
Support Added for Splitting Files Across Shards Within a Partition
Output files can now be split across shards within a partition for efficiency and performance.
Performance Improvements Made to Iceberg Table Compaction
Finally, we have improved the compaction speed for sorted Iceberg tables, so you should feel the benefits on your Upsolver-managed tables following our extensive tuning process.
So that’s everything for July! I’ll be back in August, but before then, please follow us on LinkedIn for all the latest news and announcements, or book a demo if you have not yet given Upsolver a test drive.