A closer look at Azure Synapse Link

synaptic 2.jpg

According to Mary Jo Foley, this year’s digital version of Microsoft build gave a return to the basics, focus on developers. Scanning through ZDnet’s saturation coverage last week was gone avatars from Microsoft 365with a renewed focus on the developer core program. We do not want to reiterate that we are covering the waterfront, but we would like to look at the new announcement of highlights data from last week: Microsoft’s expansion of its still new Azure Synapse Analytics platform with a new feature to bring operational data in: Azure Synapse Link.

To summarize, the new Azure Synapse Link feature expands Synapse footprint from Azure Cosmos DB; Microsoft intends to later expand the capabilities of the SQL page Azure SQL database and Azure database for PostgreSQL. While Synapse’s original mission was to unify data storage and the data lake, the new Link feature extends it to operational data.

As Azure’s latest data service (it was only announced for preview last fall), it now takes most of the limelight with regard to new announcements. And, in fact, other than core functions, such as core data storage with available computation; workload isolation within a single cluster; and materialized views, most of the differentiating features of Azure Synapse Analytics are still in public preview. In this group are Azure Synapse workspaces, the new web development studio; SQL Serverless; Apache spark to Synapse; and pipelines utilizing existing technology for Azure Data Factory. This is by no means a finished production product yet.

While technically a reputation of the old Azure SQL Data Warehouse, Azure Synapse Analytics is actually another platform as it redevelops its predecessor offerings. As we said before, Azure Synapse Analytics is part of a broader trend for cloud data warehouses to spread their footprint both upstream, to ingest and transform data, and downstream to deliver analytics. It’s a pattern we see SAP and Oracle also follows.

Big on Data bro Andrew Brust mail delivered a good blow-by-blow description of the new release, but after reading the account, we still had a key question that annoyed us: How does Azure Synapse Link pull operating data without affecting the performance of Cosmos DB?

Initially, we thought Synapse Link was connected to Cosmos DB through a federated query, where the query is submitted to Synapse and the processing is pushed down to data at source, with the tables in the external database represented as virtual (or external) tables. This is how a number of data stores work when they reach other data sources, e.g. Amazon Redshift Spectrum query data in S3, or how Microsoft SQL Server PolyBase. Oracle Big Data SQL. IBM Big SQL connect to data in cloud storage or other data stores, etc.

But Azure Synapse Link is another creature – it expands data storage to include data populated from Cosmos DB via the equivalent of a stream to change data capture. To enable it, once you have Synapse operational, you go into the Cosmos DB and check the option to enable analysis.

That is, when actions are performed for data in Cosmos DB, these changes are automatically pushed into a column optimized format for analysis without affecting the performance of the source system. Synapse can then query this data completely independently of Cosmos DB, but effectively in real time, as changed operational data is kept in sync with minimal delay.

The automatic change of data feed approach taken by Azure Synapse Link is not that unusual – for local databases. E.g, MariaDB’s X4 platform automatically repeats row-based transaction to the column store while Oracle database in the memory and IBM Db2 BLU-acceleration let customers selectively replicate to column stores in memory. But surprisingly, it is not yet widely performed in the cloud; AWS and GCP require the construction of data pipelines to enter transaction data into their data warehouse.

So until AWS and Google respond, this is where Azure Synapse Link distinguishes. The downside is that Azure Synapse Link turns the data storage into an operating system that can generate analyzes on historical and current data. The flip side, though, is that you end up paying to save the same data at least twice – not counting the number of copies routinely maintained by each service. That makes Link suitable for operational analysis, but not as cost-effective in case of ad hoc use, where you might only pay with the query (such as how Amazon Athena works when running SQL queries against S3). This is where we would like to see Microsoft add an option down the road to ad hoc query for scenarios that do not always require constant updating or where data changes are not as frequent.

Source link