Explore our expert-made templates & start with the right one for you.
Upsolver June 2024 Feature Summary
Can you believe we’re halfway through 2024 and it’s June already? Having been with Upsolver just over a year now, I have observed the platform undergo a whole heap of improvements thanks to the dedication of our product and engineering teams. Over the last few weeks, our engineers have worked their usual magic to hunt down bugs and make enhancements that further solidify the Upsolver platform.
The exciting news this month is our support for creating replication jobs that write to Amazon Redshift. Having spent over half of my lifetime working with SQL Server, I’m always thrilled when the team release capabilities that support CDC. If you haven’t discovered replication jobs yet and want to ingest from a database source (we support MongoDB, MySQL, PostgreSQL, and SQL Server), please check out the documentation so you can find out for yourself how easy this is to implement.
Replication jobs enable you to ingest your CDC data from your database into multiple target schemas, and each schema can be configured with its own set of options. The SQL syntax for creating a replication job is really straightforward and easy to understand (it’s so cool, I wish I’d designed and coded this feature!!). As well as supporting Amazon Redshift, replication jobs can also write to Snowflake, Tabular, and AWS Glue Data Catalog targets.
This month’s update includes plenty of fixes and enhancements prior to upcoming new features that I am looking forward to sharing with you next month. While I polish my CV and apply for a job with our engineering team, check out the summary of this month’s changes. And if you want a guided tour of Upsolver – whether you have CDC, streaming, or file data to ingest – please book your free demo.
The following releases are included in this update:
- 2024.05.28-11.58
- 2024.05.23-08.45
- 2024.05.19-13.23
- 2024.05.12-10.26
- 2024.05.09-08.33
- 2024.05.02-15.37
Contents
- Jobs
- CDC Jobs
- SQL Queries
- Enhancements
- Upgraded the Amazon Redshift Driver
- Libraries Upgraded and Patched
- Enhanced Schema Evolution For Amazon Redshift Targets
- Support Casting to JSON for Amazon Redshift
- Snowflake Target Schema Created if Not Existing in Target
- Improvements made to the Handling of Apache Iceberg Files
- Flag Added to System Table to Denote Managed Iceberg Tables
- Fixed an Error While Reading Positional Delete Files
Jobs
Issue Fixed with Deleting Materialized View Files
There was an issue where 2-minute merge files were sometimes not fully deleted when deleting materialized views. This problem has now been resolved.
Fixed Temporary Files Not Being Deleted in Index Creation
The bug where temporary files wouldn’t be deleted in the index creation task if the task failed, has now been fixed.
Support for Managed Iceberg Tables in Non-Sync Jobs
We have now added support for using non-sync jobs to write to Upsolver managed Iceberg tables. Please check out this article to understand the difference between sync and non-sync jobs.
Snowflake Ingestion Jobs Now Use Commit Interval
I’m pleased to report that ingestion jobs writing to a Snowflake target, now use the given commit interval. This change is applied only to jobs created from now on and not jobs created prior to this fix
Fixed the Bug When Casting to JSON in Iceberg Tables
The bug in jobs writing to Iceberg tables where using CAST to JSON type expressions caused double encoding of fields inside the JSON string, has now been fixed.
CDC Jobs
NEW! Replication Jobs Support for Amazon Redshift Target
Replication jobs in Upsolver enable you to use one job that copies data from a single source into multiple target schemas. This super handy feature is also easy to create and maintain. Each target schema is managed by a replication group, which has its own set of options that can be configured independently of other groups.
We are delighted to announce that we now support Amazon Redshift as a target. In the example below, a job is created to ingest data from a CDC-enabled Microsoft SQL Server database, into a production and a development schema in the target Redshift database:
CREATE REPLICATION JOB mssql_replication_to_redshift
COMMENT = 'Replicate SQL Server CDC data to Redshift groups'
COMPUTE_CLUSTER = "Default Compute (Free)"
INTERMEDIATE_STORAGE_CONNECTION = s3_connection
INTERMEDIATE_STORAGE_LOCATION = 's3://upsolver-integration-tests/test/'
FROM my_mssql_connection
WITH REPLICATION GROUP replicate_to_redshift_prod
INCLUDED_TABLES_REGEX = ('orders\..*')
EXCLUDED_COLUMNS_REGEX = ('.*\.creditcard') -- exclude creditcard columns
COMMIT_INTERVAL = 5 MINUTES
LOGICAL_DELETE_COLUMN = "is_deleted"
REPLICATION_TARGET = my_redshift_connection
TARGET_SCHEMA_NAME_EXPRESSION = 'ORDERS'
TARGET_TABLE_NAME_EXPRESSION = $table_name
WRITE_MODE = MERGE
WITH REPLICATION GROUP replicate_to_redshift_dev
INCLUDED_TABLES_REGEX = ('orders\..*')
COMMIT_INTERVAL = 1 HOUR
REPLICATION_TARGET = my_redshift_connection
TARGET_TABLE_NAME_EXPRESSION = 'history_' || $table_name
TARGET_SCHEMA_NAME_EXPRESSION = 'ORDERS_DEV'
WRITE_MODE = APPEND;
Each group can be defined to work at different intervals, and use different options. Explore the documentation to learn how to use replication jobs – my favorite feature in Upsolver!
Fixed Snapshotting Process Becoming Stuck
We discovered an issue whereby, during the table snapshotting phase, the process would become stuck. This is now fixed and the snapshotting process now runs smoothly.
Read about performing snapshots in your CDC jobs.
Fixed CDC Client Becoming Stuck
Another issue of things getting in a jam, was sometimes the CDC client was getting temporarily stuck after a server replacement, but I’m happy to report that this has now been fixed.
SQL Queries
Made GROUP BY Expressions Case Insensitive in Transformation Jobs
We have fixed the case sensitivity issue within transformation jobs so that GROUP BY expressions are now case insensitive relative to their SELECT expression.
In the example below, we aggregate the orders based on the customer’s city and compute the total revenue per city. With the case sensitivity removed from the GROUP BY clause, we can specify CITY in uppercase, and it will be parsed without error, returning the correct aggregation on the city column.
CREATE JOB group_by_test
START_FROM = BEGINNING
ADD_MISSING_COLUMNS = TRUE
RUN_INTERVAL = 1 MINUTE
AS INSERT INTO default_glue_catalog.upsolver_samples.orders_aggregated_data
MAP_COLUMNS_BY_NAME
SELECT customer.address.city AS customer_address_city,
SUM(nettotal) AS total_revenue
FROM default_glue_catalog.upsolver_samples.orders_raw_data
WHERE TIME_FILTER()
GROUP BY customer.address.CITY;
Please see the documentation for more information on GROUP BY, and this guide that shows you how to aggregate data in a transformation job.
Enhancements
Upgraded the Amazon Redshift Driver
The driver for Amazon Redshift has been upgraded to the latest version.
Libraries Upgraded and Patched
Further to the above upgrade, we have also upgraded all libraries to include recent security patches, thereby enhancing system security and stability.
Enhanced Schema Evolution For Amazon Redshift Targets
Upsolver handles schema evolution from sources that write to an Amazon Redshift target, whether you use an ingestion, replication, or transformation job.
We will automatically create and modify tables in your Amazon Redshift tables to accommodate changes in the data source. When a change in data type is encountered, Upsolver creates a new column in the target table if the type change is wider than the original. This prevents errors and data loss.
Learn how Upsolver handles schema evolution and resolves type casting when writing to Amazon Redshift .
Support Casting to JSON for Amazon Redshift
We have made enhancements to the Upsolver API to enable casting to JSON in jobs writing to Amazon Redshift.
Snowflake Target Schema Created if Not Existing in Target
When writing to a Snowflake database, if the target schema does not exist, it will now be created automatically, before the table is created.
Learn how we manage schema evolution in Snowflake.
Improvements made to the Handling of Apache Iceberg Files
When you choose Upsolver to manage your Iceberg tables, we take care of the compaction of your data files so experience consistently performant queries and reduced storage costs. We made improvements to how we delete dangling files from Iceberg tables, and how we expire old table snapshots.
Flag Added to System Table to Denote Managed Iceberg Tables
We have added the is_external column to the system table system.information_schema.tables to flag whether a table is an Iceberg table managed by Upsolver, but is not a target for an Upsolver ingestion job.
To query this system table, login to Upsolver, and click on Worksheets on the main menu on the left-hand side. In a new worksheet, run the following query:
SELECT *
FROM system.information_schema.tables;
-- If you have a lot of tables, use LIMIT to reduce the result set
-- LIMIT 50;
Fixed an Error While Reading Positional Delete Files
During the process of compacting Iceberg data files, an error occurred while reading positional delete files. This has now been corrected.
That’s it for this month! I look forward to bringing you all the latest news in July. In the meantime, please follow us on LinkedIn for all the latest news and announcements, or book a demo if you have not yet given Upsolver a test drive.