Volume 32 Number 10
Exploring Azure Machine Learning Studio
By Frank La | October 2017
Longtime readers of my blog (franksworld.com) have noticed over the past 18 months a marked shift in content toward data science, artificial intelligence (AI) and machine learning (ML). So, it’s only fitting that this column also shifts gears and focuses on the revolution happening all around us: The AI Revolution. Not too long ago, AI was the stuff of science fiction. Now we can add intelligence to virtually any app or Web site. In fact, many of your favorite apps and Web sites already employ some form of AI. Cortana and many other voice assistants are obvious examples of AI in the UI layer. Less obvious, but no less important, are intelligent algorithms optimizing resources, recommendation systems telling you what movies you might like and determining what you see in your social media feeds.
Over the last 10 years, the focus of many developer and IT organizations was the capture and storage of Big Data. During that time, the notion of what a “large” database size was grew in orders of magnitude from terabytes to petabytes. Now, in 2017, the rush is on to find insights, trends and predictions of the future based on the information buried in these large data stores. Combined with recent advancements in AI research, cloud-based analytics tools and ML algorithms, these large data stores can not only be mined, but monetized.
With the cloud providing affordable computing power and storage, even small businesses can predict the future by anticipating customer behavior and identifying trends at the individual level and at scale. Organizations that can discover and deploy actionable predictive models before the competition does will dominate their market segment. Properly leveraged, AI can add serious value to any business. As Peter Drucker put it, “The best way to predict the future is to create it.” In that spirit, here’s a deeper look at AI and ML.
Getting the Terms Right
Before getting into an AI project, it’s important to define the scope of what exactly is “artificial intelligence.” This will be important as future columns will rely on a common set of meaning for terms associated with this field. A quick Internet search of the term “artificial intelligence” yields a lot of various results, from chatbots and computer vision systems to debates on the nature of consciousness itself. While there’s no firm consensus as to what the term means, most experts agree generally on basic phrases, which are listed in Figure 1.
Figure 1 Generally Accepted Artificial Intelligence Phrases
The Power of the Cloud in the Palm of Your Hand
Around the middle of the last decade, I was an early adopter and proponent of the Tablet PC platform. As I would deliver presentations to various user groups and speak at conferences, one criticism would inevitably come up: lack of high-performance hardware. The reason for the lack of serious computing power on these devices had more to do with the constraints of making a tablet device viable: namely weight, battery life and cost. Many would object that while they admired the tablet PC form factor, they needed a device with a more powerful CPU. Fast forward 10 years and the limitations of cost, battery life and network connectivity have largely gone away. Any device within range of Wi-Fi and 4G networks can now connect to limitless computing services and storage resources in the cloud.
AI in the Cloud
As a developer, you have choices in terms of what types of intelligent services to consume. If an app or Web site requires image recognition or natural language processing, then Microsoft has made several services available as part of the Microsoft Cognitive Services, a set of APIs, SDKs and services that expands on Microsoft’s evolving portfolio of ML APIs. They enable you to easily add intelligent features—such as emotion and video detection; facial, speech and vision recognition; and speech and language understanding—to your applications. Microsoft’s vision is for more personal computing experiences and enhanced productivity aided by systems that increasingly can see, hear, speak, understand and even begin to reason.
These services expose models that have been trained on millions of sample images. In a video on Channel9, Microsoft’s Anna Roth briefly explains the process of training the algorithms with a wide variety of sample data (bit.ly/2x7u1D4). The models exposed by the many Cognitive Services APIs were trained over years and millions of data points by dozens of engineers and researchers. That’s why they’re so good at what they do. For when your app or Web site requires a solution that one of the Cognitive Services APIs can resolve, use them. The list of Microsoft Cognitive Services offerings continues to grow. For an updated list, go to bit.ly/2vGWcuN.
However, when your data has a more limited scope inside a specific domain, then you’ll have to create your own models from your own data. While that might sound intimidating or impractical, the process is quite straightforward with another cloud service provided by Microsoft: Azure ML Studio.
Azure ML Studio
Azure ML Studio is an online service that makes ML and building predictive models approachable and straightforward. For the most part, there’s no code involved. Users drag around various modules representing actions and algorithms in an interface resembling Visio. For maximum flexibility and extensibility, there are modules for inserting R and Python code for cases where the built-in models don’t suffice or for using existing code.
Getting Started Open a browser and head over to studio.azureml.net. If you’ve never used Azure ML Studio before, click on the Sign Up button. You may choose either the Guest Workspace or Free Workspace option in the dialog that follows (see Figure 2). For the purposes of this article, I recommend using the free workspace, as you’ll have the chance to save your projects and expose your models via Web services.
Figure 2 Pricing Tiers of Azure Machine Learning Studio
If you already have a Microsoft account, click on Sign In under the Free Workspace option. If this is the first time you’ve logged into Azure ML Studio, you’ll see an empty list of experiments. As ML is considered a subset of data science, the term “experiment” is used here.
Creating an Experiment The best way to examine the power of Azure ML Studio is to start with a sample experiment. Fortunately, there are a number of pre-built samples provided by Microsoft. First, click on the New button on the lower-left corner of the browser window. In the resulting dialog, type flight into the textbox. The screen should look similar to Figure 3. Clicking on the View in Gallery link will bring up a page detailing information about the experiment (bit.ly/2i9Q61i). Move the mouse over the tile and click the Open in Studio button to open the experiment to start working on it.
Figure 3 The Flight Delay Prediction Sample Experiment
This experiment runs what’s known as a Binary Classification, which means the ML algorithm will place each record in the dataset into one of two categories. In this case, whether or not the flight will be delayed.
Once the experiment loads your screen will look something like Figure 4.
Figure 4 The Flight Delay Experiment Opened in Azure Machine Learning Studio
While this might look intimidating at first, what’s going on is actually quite simple. Zoom in using the scroll wheel on the mouse or using the zoom controller on the lower left of the workspace canvas.
Navigating the Workspace Canvas Azure ML Studio has built-in navigation controls to explore and manipulate the view of the workspace canvas. The navigation controls from left to right are: a Mini Map of the canvas, zoom slider control, zoom to actual size button, a zoom to fit button and a pan toggle button. You may have already noticed that clicking and dragging around the canvas selects modules and does not move the canvas around. Clicking the pan toggle button will toggle the mode from selecting to panning. When pan mode is activated, the button appears blue.
Modules The workspace canvas contains modules linked together. Each module represents either a data set, manipulation of data or an algorithm. To get an idea of the contents of the source data set, select the Flight Delays Data module, right-click on the 1, and click Visualize on the context menu (see Figure 5).
Figure 5 Flight Delays Data Module Context Menu
In the resulting dialog, the contents of the data set appear in a grid. Click on one of the fields and expand the Statistics and Visualization panels. In Figure 6, the Carrier field is chosen and between the Statistics and Visualization panel, the basic shape of the data can be discovered. Click the X in the upper-right corner of the dialog to close out this view.
Figure 6 Visualizing the Raw Data
Repeat the previous steps to visualize the structure and content of the Weather Dataset.
Manipulating the Raw Data Sets Note that there are a number of modules making modifications to the data in the data set and there are two branches: one for the Flight Delays data set and the other for the Weather data set. The data in each of the data sets needs to be cleaned in order to be merged and analyzed by the ML algorithm. Notice that in the steps attached to the Weather data set, there’s even a module that executes R code. Select the Execute R Script module and, as before, right mouse click on the 1. The context menu has a Visualize option, yet it’s grayed out, as is every other option. This means that the experiment hasn’t been run. On the lower portion of the screen, click the Run button and choose Run to run the experiment. In a few moments, the experiment will finish. Depending on server load, this experiment may take longer when using the free service. Now click on the Execute R Script and right-click the 1. The visualization dialog appears displaying the output of the module. In fact, now that the entire experiment has run, every module’s data can be visualized. Exploring the modules and visualizing the data at each step, you can track the data transformations throughout the process. However, some modules’ visualizations appear different than others.
Machine Learning As mentioned previously, this experiment classifies flights into one of two categories: delayed or not delayed. The experiment first cleans the data and shapes the data into a format and structure with which an ML algorithm can work. Data scientists often refer to this process as “data wrangling” and it can represent the majority of effort in any kind of data science project.
Generally speaking, the process of an ML experiment once the data has been shaped and cleaned follows the following steps: split the data into a test set and training set, pick an algorithm to examine the data, and score the results. This experiment runs the data through two algorithms: Two-Class Boosted Decision Tree and Two-Class Logistic Regression. Each algorithm processes the data in different ways. Certain algorithms are better at certain data sets and problems than others. This is where the experimentation comes into play.
When there’s more than one algorithm in an experiment, then the models can be evaluated against one another with an Evaluate Model module. Select the Evaluate Model module, right-click on 1 and select Visualize in the context menu. The dialog will look something like Figure 7.
Figure 7 The Evaluate Model Visualization Dialog
The Evaluate Model visualization dialog contains vital information to understanding the performance of the ML models just created. The blue line represents the model created via the Two-Class Boosted Decision Tree algorithm and the red line represents the model created by the Two-Class Logistic Regression algorithm. The blue model, selected by default, has an accuracy rating of 0.806, meaning it was correct 80.6 percent of the time. Click on the red square in the chart legend to see the results from the Two-Class Logistic Regression model. Its accuracy was slightly better at 81.7 percent. Also note, the Matrix of Confusion numbers on both models. A matrix of confusion is a measure of the quality of a classification model. It measures the number of times a record was correctly flagged positively or negatively, as well as how often the model was wrong with “false positives” and “false negatives.”
The main graphical feature on this screen is the Receiver Operating Characteristic, or ROC, curve. A full explanation of this metric warrants an article of its own. More information about this metric can be found on Wikipedia at bit.ly/2fPKJnf. Assuming that random guessing would be correct about half of the time, the ROC curve would display a model with 50 percent accuracy as a straight line at a 45-degree angle. Given that both models are greater than 80 percent, the algorithm performs significantly better than random guessing. In other words, the computer has learned to predict outcomes with a fair bit of accuracy.
More Data Manipulation The remainder of the workflow involves trimming down the number of columns in the dataset from 31 to six. In order to make the data more readable, the fields OriginAirportID and DestAirportID are joined to a table with city, state, and airport names. That way 12264 becomes more readable as Washington Dulles International.
Some of the terms that Azure ML Studio uses are related to statistics and are generally outside the usual vocabulary of most developers. In truth, that’s where the bulk of the learning curve of Azure ML Studio lies, in learning the jargon of data science.
I’ve barely scratched the surface of what can be built with Azure ML Studio. The next step will be making this predictive model accessible to Web sites and apps using the built-in support for Web services. In future columns, I’ll explore other aspects of AI both inside and outside Azure ML Studio.
While machine learning, artificial intelligence, and data science in general might seem intimidating to the average developer or data engineer, the overall goal of this column is to help you discover that quite the opposite is true.
Frank La Vigne is a data scientist at Wintellect and co-host of the DataDriven podcast. He blogs regularly at FranksWorld.com and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).
Thanks to the following Microsoft technical experts for reviewing this article: Rachel Appel and Andy Leonard