Volume 31 Number 12
Big Data Development Made Easy
By Omid Afnan; 2016
There’s a big buzz around the concept of Big Data in business discussions today. Organizations as diverse as power utilities, medical research firms, and charitable organizations are demonstrating the use of large-scale data to discover patterns, deduce relationships, and predict outcomes. Online services, and the businesses operating them, were the early development grounds for Big Data and remain its most eager adopters. Scenarios such as product recommendation, fraud detection, failure prediction and sentiment analysis have become mainstays of the Big Data revolution.
Big Data is often defined as having the characteristics of the “three V’s”—large volume, high velocity and variety. While massive volume is the common association for Big Data, the need to take action on data arriving in near-real time and dealing with a large and changing variety in format and structure are equally defining for Big Data. In fact, any one of these characteristics present in data can be enough to create a Big Data scenario with the others often following soon after. The desire—and increasingly the necessity—to derive value from data that has the three V’s is the motivation behind the increased interest in Big Data.
While data science, statistical analysis, and machine learning have all enjoyed renewed attention and growth, the use of Big Data has truly been unlocked by underlying advances in distributed computing. This includes the development of software stacks that can stitch together computing clusters from inexpensive commodity hardware. Associated compilers and job schedulers make it possible to distribute a computing workload across such clusters. This puts the computations close to where the data is stored and aggregates many servers to get higher performance. These big data platforms were developed by companies such as Microsoft, Yahoo and Google for internal use, but have become available for public use through platforms like Azure Data Lake (ADL), Hadoop and Spark.
It comes as no surprise, then, that getting started with Big Data means an investment in building up a data platform based on these new technologies. At the enterprise system level, this means the introduction of a “data lake” in addition to the traditional data warehouse. The data lake is based on the concept of storing a wide variety of data types on a very large scale and in their original formats. Unlike the data warehouse model, data is first captured without any cleaning or formatting, and is used with a late-binding schema. This kind of architecture requires the adoption of the new software platforms, ADL, Hadoop or Spark. For the data developer, it means having to learn new programming models and languages while dealing with more complex execution and debugging situations.
Fortunately, Big Data systems have already come a long way. Let’s say you want to build a product recommendation system with information collected from the online shopping site you operate. Your plan is to send e-mails to shoppers to promote new products based on previous shopping patterns. You might also want to recommend products while the customer is shopping on the site based on what other customers did. Both scenarios fit well with Big Data techniques. Until recently you’d have to start by selecting a particular Big Data stack, designing and procuring the cluster hardware, installing and tuning the software, and then you could start to develop the code for collecting, aggregating, and processing the data you have. Based on your investment, you would likely feel locked into the particular hardware and software stack you chose.
While stacks like Hadoop and Spark have become easier to install and manage, the coming of Big Data in cloud services is an even bigger game changer. As a cloud service, ADL provides both the storage and computation power needed to solve Big Data problems. More important, one of its key principles is that Big Data development needs to be easy.
Big Data Gets Easier
ADL is a cloud-based environment that offers a Big Data query service. That means you don’t have to set up and manage any hardware. When you’re ready to do Big Data development, you simply create ADL accounts on Azure and the service takes care of allocating storage, computation and networking resources. In fact, ADL goes further and abstracts away the hardware view almost completely. ADL Storage (ADLS) works as an elastic storage system, stretching to accommodate files of arbitrary size and number. To run queries and transformation on this data the ADL Analytics (ADLA) layer allocates servers dynamically as needed for the given computations. There’s no need to build out or manage any infrastructure—simply ingest data into ADLS and run various queries with ADLA. From a developer’s viewpoint, ADL creates something that looks and behaves a lot like a server with infinite storage and compute power. That’s a powerful simplification both for setting up the environment and understanding the execution model.
The next simplification comes from the U-SQL language. U-SQL is a unique combination of SQL and C#. A declarative SQL framework is used to build a query or script. This part hides the complexity that arises from the actual distributed, parallel execution environment under the covers. You, the developer, don’t have to say how to generate a result. You specify the desired outcome and the system figures out the query plan to do it. This is the same as what happens with SQL on an RDBMS but different than other Big Data stacks where you may have to define map and reduce stages, or manage the creation of various types of worker nodes. The compiler, runtime and optimizer system in ADL creates all the steps for generating the desired data, along with the possible parallelization of executing these steps across the data.
Note that a query can actually be a very complex set of data transformations and aggregations. In fact, the work of developing Big Data programs with U-SQL is writing potentially complex scripts that re-shape data, preparing it for user interaction in BI tools or further processing by other systems such as machine-learning platforms. While many transformations can be done with the built-in constructs of a SQL language, the variety of data formats and possible transformations requires the ability of add custom code. This is where C# comes into U-SQL, letting you create logic that is customized, but can still be scaled out across the underlying parallel processing environment. U-SQL extensibility is covered in-depth in Michael Rys’ online article, “Extensibility in U-SQL Big Data Applications,” (msdn.com/magazine/mt790200), which is part of this special issue.
The addition of support in development tools such as Visual Studio provides another critical simplification: the ability to use familiar tools and interaction models to build up your Big Data code base. In the remainder of this article I’ll cover the capabilities provided for U-SQL programming in Visual Studio and how, together with the previously mentioned capabilities, they provide a major shift in the ease of use for Big Data overall.
To make Big Data development easy, you start with being able to construct programs easily. The fact that U-SQL combines two very familiar languages as its starting point makes it quite easy to learn and removes one of the barriers for adopting Big Data. In the case of development tools such as Visual Studio, it also makes it easy to reuse familiar experiences such as the built-in support for C# development and debugging. The Azure Data Lake Tools for Visual Studio plug-in (bit.ly/2dymk1N) provides the expected behavior around IntelliSense, solution management, sample code and source control integration.
Because ADL is a cloud service, the Visual Studio tools draw on the integration with Azure through the server and cloud explorers. Other experiences must be extended or changed: IntelliSense for U-SQL includes an understanding of the tables and functions defined in the cloud within ADLS, providing appropriate completions based on them. Another key set of features is found in Figure 1, which is a representation of the query execution graph that results when a query is compiled for execution in the distributed, parallel cloud environment.
Figure 1 U-SQL Job Graph Viewed in Visual Studio
When you develop a U-SQL script, you start it as you do in other languages. You create a new project where you can add U-SQL type files. You can insert code snippets into query files to get you started, or access sample code from a sample project. Once you’ve written some code, it’s natural to compile and run your code to find and fix any errors and test your logic. This then becomes the tight loop for your work until the code reaches an overall functional level.
This work loop has some complications for Big Data. Big Data code runs in Big Data clusters, so the existing situation is for developers to run their code in a cluster. This often requires the setup and maintenance of a development cluster. It can also take longer to go through the full cycle of submitting, executing and receiving results. The tight loop in this case becomes too slow. In some technologies a “one box” install of a cluster is available. This probably requires special installation tools, but it might be installable on the development machine you use.
U-SQL simplifies this situation by providing a compiler and runtime that can execute on a local machine. While the execution plan (including such things as degrees of parallelism, partitioning and so on) for a local machine and the cloud-based environment are likely to be quite different, the computation and data flow graph will be the same. So, while the performance of the local run won’t be comparable to that in the cloud service, it provides a functional debugging experience. The necessary tools for local run are installed with the Azure Data Lake Tools for Visual Studio plug-in and are available immediately. They can also be installed as a NuGet package and run from the command line for automation purposes.
The local execution environment looks like another ADLA account. You can see it in the Server Explorer window under the Data Lake Analytics section, where your cloud-based accounts are listed. On submission dialogs it’s shown as an option on the list of target accounts with the label “(Local).” This local account can have all the child nodes that real accounts have. This means you can define metadata objects such as databases, tables and views. You can also register C# libraries here to support execution of U-SQL scripts that have user-defined code. This is necessary to provide parity with the cloud environment for full testability of your U-SQL code.
The U-SQL development loop then looks very much like other languages. You can create a U-SQL project and once enough code is ready, you can compile to find syntax errors. You can also run locally through the Debug (F5) and the Run without debug (Ctrl+F5) commands in Visual Studio. This will expose runtime errors such as data-parsing problems during file ingestions (the EXTRACT command in U-SQL), a very common debug case. At any point you can switch to submitting the code to run in your ADLA account in Azure. Note that this will incur charges based on the overall time for the query.
Given that datasets for Big Data scenarios are often too big to use on a development machine, it becomes necessary to manage data for test and debugging purposes. A common way to handle this is to refer to data files by relative paths. The U-SQL compiler interprets relative paths from the root of the default storage for a given execution environment. Each ADLA account has a default storage account associated with it (you can see this in the Server Explorer window). For execution in the cloud, file paths are found in this root. When executed locally, paths are searched under the global data root directory shown under Tools | Options | Azure Data Lake.
For development, a common practice is to specify a relative path to a file that exists both locally and in the cloud. The script can then be run without modification both locally or as submitted to ADLA. In the case that the input data already exists in Azure, you can download a portion of the file. The ADL experience in Visual Studio lets you do this by navigating to the file from the Job Graph or File Explorer (from the context menu on the storage accounts in Server Explorer) and selecting the download option. If the data doesn’t yet exist, then a test file must be created. With the data in place, the local development loop can proceed as before.
Debugging with User-Defined Code
The fact that U-SQL lets you use C# to define customer code introduces additional debugging capabilities. Briefly, C# extensions must be registered in a database in the ADLA account where the related query will be executed. As mentioned earlier, this can also be done in the local run scenario using the “(Local)” account. The general case is that you create a separate C# class library project (there’s a project type for this under the ADL area) and then register it.
There’s also another easy way to define user code and have Visual Studio manage the registration for you. You’ll notice that in a U-SQL project, each file automatically has a codebehind file associated with it. This is a C# file where you can add code for simple extensions that aren’t meant to be shared with other projects. Behind the scenes, Visual Studio will manage creating a library, registering and unregistering for you during submission of the query for execution. Again, this works against a “(Local)” account, as well.
Regardless of how the user-defined code is created, it can be debugged like other C# code during execution in the local environment. Breakpoints can be set in the C# code, stack traces examined, variables watched or inspected, and so on. The Start Debugging (F5) command kicks off this capability.
Debugging at Scale
Up to this point, I’ve discussed the capabilities that let you build U-SQL code in projects, specify data sources, compile and run locally, and step through C#. If you’re thinking, “That sounds like every other language I code in,” that’s great! I mentioned that the goal here is to take an inherently complex distributed, parallel-execution environment and make it look like you’re coding for a desktop app. I’ve also talked a bit about how to manage data sources and C# extensions in your code between local and cloud environments. Now, let’s talk about something unique to the debugging of Big Data jobs at scale (that means running in the cloud).
If you’ve followed the approach of developing your code and debugging on your local machine, then at this point you’ve probably figured out any logic errors and are getting the outputs you want. In other words, your business logic should be sound. With Big Data, though, you’ll have massive amounts of data and the format of the data is likely to change. Remember that in data lake architecture, you store data in its native format and specify structure later. This means data can’t be assumed to be well-formatted until after you’ve processed it to be that way.
Now, when you run your tested code at scale in the cloud and try to process all of your data, you’ll see new data and might start to hit problems you didn’t find before. Also, your query has been compiled, optimized and distributed across possibly hundreds or thousands of nodes, each of which execute a portion of the logic on a portion of the data. But if one of those chunks of work fails in an unrecoverable way, how can you figure out what happened?
The first way that U-SQL and ADLA help you with this is to manage error reporting as you’d expect a job service would. If an exception occurs outside of user code, then the error message, accompanying stack trace, and detailed data are collected from the offending node and stored against the original job (query submission). Now, when you view the job in Visual Studio or the Azure Portal, you’ll immediately be shown the error information. No need to parse through log or stdout files to try and decrypt the error location.
An even more interesting and common case is when you have custom code and the failure happens there. For example, you’re parsing a binary file format with your own Extractor and it fails on a particular input halfway through the job execution. In this case, ADLA again does a lot of work for you. For the parts of the query execution graph that succeed, neither the code image nor the intermediate data is kept in the system. However, if a vertex (an instance of a node in the execution graph) fails with an error in the user code portion, the binary executable and the input data are kept for a period of time to allow debugging. This feature is integrated with Visual Studio, as shown in Figure 2.
Figure 2 Debugging a Failed Vertex Locally
Clicking the debug button starts a copy of the binary, resource and data files from the failed vertex to your local machine. Once the copy is downloaded, a temporary project is created and loaded in a new instance of Visual Studio. You now have the executable version of the failed node and can do all the normal debugging you might expect, such as running to the exception, putting breakpoints in the C# code, and inspecting variables. Remember that the individual vertices in the clusters used to run Big Data jobs are actually commodity servers and are likely to be similar to your development machine. Because you also have a local U-SQL runtime on your machine, this capability becomes possible.
Once you have debugged the problem on your local machine, you’ll update your source code separately. The project and code shown in your debug instance are artifacts from a previously run query and any changes you make are local. If you have a registered library of C# code, then you’ll have to rebuild and update the library in ADLA. If you made changes to your U-SQL script or codebehind files, then you must update your project.
Big Data platforms have been highly specialized systems that required learning new concepts, models and technologies. Most early adopters had to go under the hood to learn the inner workings of these systems, first to set them up and then to be able to reason about programming in those environments. While the field has moved forward to the point that installing these platforms has become simpler, a bigger shift is underway. The opportunity today is to leapfrog to a Big Data service model in the cloud where the system abstraction is at a higher level. While this makes the setup of a Big Data environment trivial, an even more impactful outcome is that the development model is immensely simplified.
The combination of Azure Data Lake and U-SQL simplifies the execution model, programming paradigm, and tools used to develop Big Data queries and applications. This has the dual effect of enabling more developers to get started with Big Data, and for developers to build more complex Big Data solutions more quickly. ADL is supported by a large set of analytics services in Azure that support workflow management, data movement, business intelligence visualization and more. While U-SQL is the best place to start for Big Data application development, look to these other services as your needs grow.
Omid Afnan is a principal program manager in the Azure Big Data team working on implementations of distributed computation systems and related developer toolchains. He lives and works in China. Reach him at email@example.com.
Thanks to the following Microsoft technical expert for reviewing this article: Yifung Lin