Volume 32 Number 12
Exploring the Azure Machine Learning Workbench
By Frank La | December 2017
In the last two columns, I explored the features and services provided by Azure Machine Learning Studio. In September 2017, Microsoft announced a new suite of tools for doing machine learning (ML) on Azure. The cornerstone of these new tools is Azure Machine Learning Workbench. However, what could be better for doing ML than the simple drag-and-drop interface of Machine Learning Studio?
Machine Learning Studio is an ideal tool for creating ML models without having to write code, but it falls short in several areas. First and foremost, the tool’s simplicity requires a “black box” approach. There’s no visibility into the algorithms being used, and only parameters that are exposed in the UI can be manipulated. Source control is effectively nonexistent. While the Run History makes it possible to access previous versions of a project, Machine Learning Studio lacks integration with conventional source control tools such as Git. Finally, managing the deployment of models exposed as Web services poses unique challenges at scale.
Since the Machine Learning Studio launch in 2015, the team has been gathering feedback from users large and small. They’ve taken that feedback and created something quite unique with Machine Learning Workbench—a tool that satisfies the needs of ML professionals of all levels.
Create Machine Learning Accounts
Machine Learning Workbench requires the provisioning of Machine Learning accounts in Azure. Go to the Azure Portal (portal.azure.com) and click the New button in the upper-left corner. In the textbox that appears type “Machine Learning Experimentation” and click on the first result. On the following blade, click on the Create button at the bottom of the screen.
Next, a blade appears that asks for some information used to create the service. You may choose any name that meets the validation requirements. Please refer to Figure 1 to see the values I chose. Currently, this service is only available in three geographies. As I’m located on the East Coast of the United States, I selected East US2. Ensure that the checkbox next to Create model management account is checked. Next, choose a pricing tier. For the purposes of this article, the free DevTest tier will suffice. Click on the DevTest tier and click Select. Last, click on the Create Button.
Figure 1 Options Selected to Create a Machine Learning Experimentation Service
While the cloud services initialize, it’s a good time to set up your local computer.
Installing Machine Learning Workbench
While Machine Learning Workbench is available on Mac and Windows, this article follows the path for a Windows install. For further details and system requirement to install on a Mac, please refer to Microsoft’s documentation on the topic at bit.ly/2zYrA7X.
If you haven’t already installed Docker on your system, now would be a good time to do so. While this article will not make use of it, future articles on Machine Learning Workbench will. Machine Learning Workbench uses Docker to run code locally in different frameworks. Some projects have Python and PySpark containers available by default, while others just have Python containers. Available container configurations can be managed by configuration files.
Machine Learning Workbench for Windows can be downloaded from aka.ms/azureml-wb-msi. Once the installer downloads, run it and follow the directions to install the tool. When the install is complete, launch the program. Enter the same account credentials used to create the Machine Learning Experimentation Service in Azure. After a moment, the screen updates to show the Get Started window. Note in the Projects area to the left that the name of the workspace should match the value entered previously into Azure. Also take note that in the bottom left-hand corner of the screen, a teal button appears with the initials of the active account. Click on that button to reveal a fly-out menu where you can see the current account and sign out.
Creating and Running a Project
To create a new project, either choose New Project from the File menu or click the plus sign in the Projects pane. In the following dialog box, shown in Figure 2, enter SimpleLinearRegression as the project name, choose a directory for the project files, and select Simple Linear Regression from the available project templates. Project templates are a great resource for learning how to create an ML project based on various types of models and external libraries. Click the Create button to create the project.
Figure 2 Create New Project Dialog
Once the project is created, the project dashboard opens. Each project in Machine Learning Workbench opens to its project dashboard. The dashboard contains instructions and information about the project and functions very much like a readme.md file in a GitHub repository. In fact, the contents of the dashboard file are located in the readme.md file in the root directory of the project.
Choose Open Command Prompt from the File menu. In the command prompt type:
conda install matplotlib
Follow the instructions on screen if prompted to install matplotlib, a graphing library for Python. Once the install completes, close the command prompt window to return to the Machine Learning Workbench. Take note of the Run button and the controls to its left. Make sure that local and linar_reg.py are selected in the two dropdowns and the Arguments text box is empty. Click Run.
Immediately after clicking Run, a Jobs pane appears with one entry in the list. After a few moments, the job will complete.
Exploring the Interface - Jobs
To the left of the screen, there’s a vertical toolbar, click on the fourth icon from the top, which resembles a clock with a counterclockwise arrow around it. In this case, the linear_reg.py file failed to run twice before executing successfully, as seen in Figure 3. Note that Machine Learning Workbench tracks the jobs executed through it. At the bottom of the window under the STATUS header, click the word Completed to view the properties of the run.
Figure 3 The Jobs Run History
Take some time to explore the various portions of this screen: Run properties, Outputs, Visualization and Logs. Run properties displays data about the properties of this particular run. If the Python script creates output files and places them in an outputs subfolder, they’ll appear in the list under the Outputs section. In this run, there’s one file: lin.png.
Further down the screen is the Visualization section, which displays any images created by the Python script. For this script, there’s one that plots the linear regression from the data files included in the project. This is the same file shown in the Outputs section: lin.png. Last, the Logs section lists all the logs associated with this run.
Exploring the Interface - Files
So far, I’ve run a Python script to perform a linear regression and produce an output visualization, but I haven’t shown off any code. Let’s take a look at the code behind the job that was just run. On the toolbar on the left-hand side of the screen click on the folder icon immediately below the Jobs icon. In the panel that appears, there’s a list of files associated with this project: three directories and five files. Click on the linear_reg.py to view the contents of the script that was run earlier.
For readers unfamiliar with Python, the code first imports various libraries to do advanced mathematics with arrays (NumPy) and plot graphs (matplotlib). Next, the code loads the data from a local file, data.csv, into an array. Then the code converts the array into an nparray, which enables NumPy to perform matrix multiplication. After some calculations, the code computes a linear regression formula that approximates the data and prints out its findings. In lines 35 through 39, the code uses matplotlib to render a graph of the data and plot a line representing the linear regression. The rest of the code prints out information pertaining to the error and accuracy of the linear regression model.
Once you’ve had a chance to explore the script file, click on the data.csv file to view the data file. Note that the file immediately loads and displays a .csv file with two numbers per row separated by a comma.
Data Munging Tools Built In
The second icon from the top on the toolbar brings up the data pane. Clicking on it reveals a tree control with two empty nodes. The linear_reg.py file managed the data on its own and didn’t make use of the Machine Learning Workbench advanced data munging tools. Now would be a good time to explore creating a data source through Machine Learning Workbench. Click on the plus sign in the data pane and click on Add New Data Source.
In the following dialog box, click on the box labelled File(s)/Directory and then click Next. Browse to the data.csv file in the project directory and click Open. Click Next once more. Note that Workbench has already detected that this is a comma-separated file. Leave all the settings as is and click Next. In this Data Types step, notice that Machine Learning Workbench automatically detected the formats of each data field. Click Next to view the Sampling step.
For this small data set, there’s no need to change anything. For larger data sets, this screen allows you to craft a custom strategy to load data selectively. This way, when dealing with a multi-petabyte data set, you can selectively deal with a small portion of the data to define rules on how to transform the data. Click Next and leave the default selection to not include the path column in the final data set. Click Finish. The screen should look like Figure 4.
Figure 4 The Resulting Data Set
Note that the pane on the right side displays the steps just taken in a list. Clicking on any one of the items will reveal an option to edit the action. In more advanced scenarios, with more steps taken to munge the data, this feature allows for changing and editing the steps.
At the top of the screen, there are two toolbar buttons: Metrics and Prepare. Click on Metrics. In a few seconds, Machine Learning Workbench automatically generates a series of data visualizations for each column. When working with a new data set, it can be helpful to gather some basic information about each field. The metrics view creates histograms and basic descriptive statistics about every field in the data set. You can customize which metrics are displayed by clicking on the Choose Metric dropdown list. By default, all metrics are selected. Click on the newly visible Data toolbar button to return to the previous screen.
Now, click on the Prepare button. In the Prepare dialog, leave the top dropdown list at its default of New Data Preparation Package and enter SimpleDataPrep into the Data Preparation Package Name. Click OK. Now the Data pane has two entries
Right-click on the SimpleDataPrep entry and click Generate Data Access Code File. The newly created file opens, displaying stub code that loads a referenced package and returns a Pandas DataFrame. In a PySpark environment, this call returns a Spark DataFrame. DataFrames are a common data structure in Python. Using the DataFrame creation code generated by Machine Learning Workbench can save a great deal of time.
I was skeptical that Machine Learning Studio could be improved upon at first. However, the more I use Machine Learning Workbench, the more impressed I am with it. It not only provides mechanisms to import data, it also auto-generates a package that can clean data and expose it as a DataFrame, a common data format for Python. By importing data this way, data scientists can save a great deal of time. Furthermore, you saw that using the data import tool was not a difficult requirement. Machine Learning Workbench also allows for the use of any Python library, such as matplotlib.
This article barely scratches the surface of what Machine Learning Workbench is capable of. In future articles, I’ll explore this great tool even more. For instance, configuring Machine Learning Workbench to work with virtual machines in Azure for fast processing, transforming data “by example,” and working with Jupyter Notebookz inside Machine Learning Workbench.
Frank La Vigne leads the Data & Analytics practice at Wintellect and co-hosts the DataDriven podcast. He blogs regularly at FranksWorld.com and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).
Thanks to the following technical experts for reviewing this article: Andy Leonard (EnterpriseDNA), Hai Ning (Microsoft), Jonathan Wood (Wintellect)
Andy Leonard is a self-described Data Philosopher who literally lives in Farmville. No, not the game, the actual town. He slings code and manages data. He has a rockin' beard and gets stuff done. Andy Leonard is an author and engineer who enjoys building and automating data integration solutions
Jonathan has been interested in software and technology since his TechTV days in high school. He’s been working with .NET and C# for over eight years with a primary focus on Web and Mobile Computing. His most recent passion is anything data related. Jonathan is a software consultant and developer at Wintellect concentrating on Data Science, Data Visualizations, Machine Learning, Statistics, and any and all data topics he can get his hands on.