About six months after it was unveiled at the Ignite conference last fall, microsoft took a group of analysts and MVP professionals on a deep dive Azure Synapse Analytics favor. As noted in Andrews coverage Last fall, Azure Synapse Analytics is a redirect and development of Azure SQL Data Warehouse, expanding its footprint to span data warehouses, data lakes and data integration within a single cloud service.
The guiding idea is to get as close as possible to a single source of truth, which in this case constitutes converging the data warehouse, data lake and data integration. This is a challenge that is much more difficult than it sounds, as it not only brings together highly curated relational data with a wider range of variable and semi-structured data, but it also brings together different groups of practitioners with skills, methods and calculating requirements often diametrically opposed.
On one end, you have database developers who are skilled at working with SQL, while at the other end of the spectrum, data scientists and developers working off the data lake typically work with programmatic analysis in languages like Python. Data warehouses, like any relational system, have typically been used for production and operating scenarios that require reliable performance, and often the ability to serve a large group of users, while data loops are more associated with experimentation with widely varied data sets and less predictable workloads there. serves a handful of end users.
So the result is that you have different workload characteristics, different data types and different access patterns. It is the same rationale that fertilized data warehouses many years ago when workloads of queries and reporting affected operating systems. But with Azure Synapse, Microsoft is looking at the analytics side to bring the poles together. Although Azure Synapse is a generally available service today, the extended platform is barely six months out of the gate. So while Azure Synapse has the capabilities to support business analysts and data scientists, there are still more pieces that are falling into place.
Let’s start by keeping the lights on. Workload management has been a well-known problem of data storage for years – demand patterns for ad hoc query, end-of-period reporting and complex analysis are well-known, and for years, turnkey data storage providers like Teradata offer a family of models optimized for data demanding, respectively. , computer-intensive, high or low competition, and “balanced” workload to maximize the output of computed resources.
When Hadoop came along, it was thought that the cobblestones of workloads would be data intensive, and then the computer was moved to the data. Enter cloud-native, and the commute swung back to separate computation from data for financial reasons (analytical workloads tend to be spikey, so why pay for computation you don’t always use) with the high bandwidth of modern cloud backward flights addressed the data movement problem. Then came the AI, which, depending on whether it is machine learning or deep learning, has different resource needs.
So bringing the data store together with the data lake is no mean feat. Azure Synapse has tackled the workload problem with a cloud-native architecture that relies on separating the computer from storage in SQL Data Warehouse Gen 2 and extending this concept to heterogeneous SQL and Spark computation within a single service. Currently, they use Azure Data Lake Storage (ADLS) Gen 2, which is designed to deliver the economics of cloud object storage with performance benefits by exposing the data through a POSIX compliant file system API. Azure Synapse Analytics also offers a multi-level hierarchical cache in the SQL engine that automatically moves data between performance (which includes disk storage and NVMe SSD cache) depending on the user’s workload, while Spark analysis runs on high memory (8-GByte) / node) instances.
Functionally, Azure Synapse Analytics starts combining Azure Data Factory with Azure SQL Data Warehouse – the former is still available as a standalone service, while Azure Synapse replaces the latter. And while it doesn’t bundle Power BI or Azure Machine Learning directly in the same service, integrations are built into the metadata and user interface level, so the flow is natural.
Azure Synapse uses the concept workspace to organize data and code or query artifacts. And the workspace can surface as a low-code / no-code tool for business analysts or a Jupyter-like notebook for data engineers and data scientists to work in Spark or use machine learning models. In the demos, Microsoft showed how the same data transformation task could be developed using both paths. There will be some differences in the experience – e.g. While Synapse inherits Azure SQL Data Warehouse capability to support high concurrency, spark environments have typically involved lone wolf data scientists or data engineers. There are also differences in levels of data security – practices are far more mature on the relational database page with table, column and built-in row-level security, but not so mature on the data lake side. It is an area where Cloudera differentiates itself with SDX, which is available as part of its platform offering.
Due to the early phase of implementing the Spark feature, Python is currently supported, but R is not there yet. Given Python’s momentum, it’s probably not necessarily a show-stopper for most computer scientists.
Since this is a highly optimized platform, it is not surprising that Microsoft has added some customizations to its Spark and Jupyter-like Interact implementations of notebook computers and that not all Spark libraries are currently supported. Without delving into the weeds, Microsoft is looking for a more complete Spark implementation in Azure Synapse when Spark 3.0 comes out. Still, for data scientists and engineers who want the pure Spark experience, Azure Databricks will remain the better choice.
So what’s on our wish list?
Currently, Azure Synapse Analytics works on the notion of a single data lake composed of relational tables, folders, and files of various formats. In the future, we would like it to reach more data platforms in the Azure portfolio, as we consider the data lake as data collection, wherever it is, in the company. To that end, for Spark practitioners, we would like to see first-party integration with Azure Databricks. There is room to extend supported computational modes, especially for AI workloads that require GPUs or ASICs. We would also like to see a hybrid strategy where Microsoft already has a foot in the door with Azure Stack and Azure Arc. And we also want to see an Azure Synapse partner program that provides close integration and support for third-party tools that can be connected to the workspaces.
Oh, and another thing. Today, Power BI and Azure Machine Learning are treated as auxiliary services – as mentioned above, they are integrated into Synapse, but they are not integrated into the service. In the longer term, we believe that both services should be packaged as integral parts of Azure Synapse. Today, we believe that virtually all customers using Synapse also use self-service visualization, whether with Power BI or third-party tools such as Tableau. On the other hand, today’s machine learning is not quite the case, but we expect it to change fairly quickly within the next few years or less with internally developed or pre-built third-party models that will become ubiquitous. It is the handwriting on the wall.
This is not Microsoft’s first staff by bridging the data warehouse with the data lake. For premises there were SQL Server 2019 Big Data Clustersthat placed an SQL Server engine on each node in a Hadoop cluster that enabled the data lake (as originally defined by clusters with data stored in HDFS) available for SQL query. But Azure Synapse is a complete innovation. More than just making big data available to SQL and Python developers alike, it also changes the development environment by creating “workspaces.” It addresses a wider portion of the analytic life cycle, from data capture, transformation and integration all the way through self-service visualization and even collaboration by integrating Power BI reports into Microsoft Teams.
But more precisely, Azure Synapse reflects the fact that providers in the cloud can break down silos in the tool chain to present more aggregate offerings that cover more of the life cycle. Microsoft is hardly the only provider that goes this route. SAP Data Warehouse Cloud takes a similar approach by integrating SAP Analytics Cloud to provide self-service visualization last mile, while Oracle has begun publicly talking about expanding Autonomous data warehouse into a broader platform offering that just like Azure Synapse would include more of the lifecycle (we expect Oracle Analytics Cloud integration to be a core component). So now we wait for the next shoes to drop AWS and google.