Enterprise software development and open source big data analytics technologies have largely existed in separate worlds. This is especially true for developers in Microsoft .NET ecosystem. The reasons for this are many, including the .NET Windows legacy and open source analytics stack’s fidelity to Linux.
But Microsoft’s .NET Core, already in its third major version, is cross-platform running not only on Windows, but also on Linux and macOS. and Apache sparkles, which was largely eclipsed Hadoop as an open source analytics poster child, has made its way into numerous Microsoft platforms, including its flagship SQL Server database and Azure Synapse Analytics, Redmond’s latest gambit in cloud data warehouse wars. Despite this development, coders on the Spark platform have largely stuck with it Scala. Python. R and Java. What was missing was something that connected the dots between .NET and Spark.
Casting to .NET
All this changed a year ago when it was on The Spark and AI Summit, Microsoft introduced a preview of its .NET for Apache Spark framework that provides bindings to developers who use C # and F # languages on the .NET platform. And that plot got thick a couple of weeks ago when Microsoft expanded .NET to Apache Spark to support in-memory .NET DataFrames, something Brigit Murtaugh, Program Manager for .NET for Apache Spark, announced in a blog posts.
I’ve been involved with .NET since it was still in the alpha days 20 years ago. And I’ve been involved in the big data world almost half of that time. I have wanted to see these two worlds converge and have argued for such a union. Other than this, I hadn’t really researched the .NET for the Apache Spark framework (hereafter, Spark.NET) until now, choosing instead to hobble most of it in Python. Now that I’ve examined the framework more carefully, I like what I see and would like to report on it. The good news: Spark.NET works well and, in addition to integrating the two technologies, makes their respective programming paradigms dovetail very nicely.
Microsoft has worked hard to get the Spark.NET barrier to access low. Random point: The .NET for Apache Spark site includes a large white “Get Started” button that guides developers through the process of installing the framework, creating a preview of a WordCount program, and running it. It takes the developer through the installation of all the necessary dependencies, configuration steps, installing .NET for Apache Spark itself and creating and executing the sample program.
The entire guided procedure is designed to take 10 minutes and assumes little more than a clean machine as a boot environment. I did pretty much succeed (with the warning that I had to research and manually set the environment variable SPARK_LOCAL_IP to localhost to get the test running on my Windows machine), and I have to say it’s pretty busy to get it to work.
Choose your environment
The tutorial is designed to do everything from the command line, including editing an input text file and the C # code; compiling the application and running the .NET console application by calling Sparks feature submission tool. But experienced .NET developers who prefer to work in Visual Studio 2019 can also use Spark.NET from there.
I actually confirmed it. After working through the Get Started tutorial, I created a new C # console program in Visual Studio 2019, used the NuGet package manager to add Spark.NET to my project, and then replicated the coding steps in the command-line Microsoft tutorial. After assembling everything in Visual Studio, I sent the job to Spark and everything was running fine.
Ready for Spark prime time
Once you’ve got things running locally on a dev machine, you’ll want to try running on a full-fledged spark cluster. These days it’s probably in the cloud. The tricky part is that you will need to make sure Spark.NET is installed in the cluster before your own code can run. Microsoft says that spark clusters on Microsoft’s own Azure HDInsight service as well as Spark pools in Synapse Analytics (currently in preview) already have Spark.NET on board.
Beyond that, though, Microsoft delivers explicitly Instructions to deploy .NET for Apache Spark to Azure Databricks * and to Databricks Unified Analytics Platform service running on Amazon Web Services. Still not impressed? Microsoft also delivers installation instructions to AWS ‘ubiquitous Elastic MapReduce (EMR) service.
Also read: Databricks comes to Microsoft Azure
You can distribute your .NET collection to your Spark cluster and run that batch from the command line if you wish. But for C # developers, Microsoft has also enabled the very common scenario of working interactively in one Jupyter notebook. This support includes a Jupyter notebook core that utilizes C # REPL (read-eval-print loop) technology that is extremely innovative in itself. Microsoft also provides an F # core.
When you combine notebook support with Microsoft’s activation of Spark.NET-based Spark SQL UDFs (custom features), you support .NET DataFramesand implementation abstraction above Apache arrow RecordBatch objects, you can see that Microsoft has worked hard not only to bring Spark into the .NET world, but also to bring .NET into each of several instances of Spark programming usage. It also makes things work well – Apache Arrow supports sharing column data in memory, eliminating the cost of converting the data to and from different formats to process it.
What’s the point?
Experienced spark developers are unlikely to switch from, for example, Python to C # to do their job, and Microsoft has no illusions about it. But the number of lines with .NET code out there created over the last 20 years is staggering. Bringing even a small fraction of this code into the world of open source big data has a lot of value. So, too, legions of .NET developers bring into the world analyzing high-volume data sitting in data lakes, as well as flow data and machine learning applications that Spark enables.
In other words, Microsoft’s goal here is to bring worlds of enterprise software development, analysis and data science together. Mixing these communities, using cases and skills, rather than leaving them in separate silos, is logical and commendable, so there it is. But more importantly, if we look seriously at data-driven decision making, pervasive data culture and digital transformation, uniting these communities and sub-disciplines must happens – it is critical and not discretionary.
What’s more, Microsoft integrates communities and technology by elegantly blending their paradigms instead of subordinating one to the other. This subtlety provides practitioners in each community with a portal to see the wonders of others and not just smoosh them together in a disdainful way that would produce a worst-of-both-worlds result. Instead, the pragmatism and platform openness that Satya Nadella has created at Microsoft has made its way down to a developer framework. There is nothing but the head in it.