January 2018

Volume 33 Number 1

[Artificially Intelligent]

Creating Models in Azure ML Workbench

By Frank La | January 2018

Frank LaVigneIn my last column, I introduced Azure Machine Learning Workbench (Azure ML Workbench), a new tool for professional data scientists and machine learning (ML) practitioners. This stands in stark contrast to Azure Machine Learning Studio (Azure ML Studio), which is a tool primarily geared toward beginners. However, that doesn’t mean Azure ML Workbench is only for experienced data scientists. Intermediate and even entry-level data scientists can also benefit from the tools provided in Azure ML Workbench.

Loading the Iris Classifier Project Templates

As noted in my previous column, Azure ML Workbench provides numerous project templates (I used the Linear Regression template). This time I’ll utilize the Classifying Iris project template to demonstrate even more features of Azure ML Workbench. If you haven’t already installed Azure ML Workbench, please refer to the documentation at bit.ly/2j2NVdH.

The Iris data set is a multi-variate data set that consists of 50 samples from each of three species of Iris. Four features were measured from each sample: the length and the width of the sepals and petals. Based on the combination of these four features, the species of Iris can be determined. It is an oft-used sample data set in data science and ML. 

Open Azure ML Workbench, select Projects and click on the plus sign. In the context menu that appears, choose New Project to create a new project. Name the project IrisClassifier. Look for Classifying Iris in the Project Templates, click on it and click on the Create button (see Figure 1).

Choosing the Classifying Iris Project Template
Figure 1 Choosing the Classifying Iris Project Template

Viewing the Code

Once the project loads into Azure ML Workbench, click on the folder icon on the left side to reveal all the files included. Click on the iris_sklearn.py file to view its contents in the editor. It should look similar to Figure 2.

The iris_sklearn.py File in the Azure ML Workbench Text Editor
Figure 2 The iris_sklearn.py File in the Azure ML Workbench Text Editor

In case you were wondering, the code is in Python, a language popular with data scientists and ML practitioners. Python enjoys a diversity of ML, scientific and plotting libraries that provide the language with a rich ecosystem of tools and utilities. One of these is scikit-learn, a popular ML library. Various segments of the scikit-learn, referred to as sklearn in the code, are imported into the project in lines seven through 14. While a full tutorial on the Python language falls outside the scope of this article, the syntax should be familiar to any C# developer. The focus here will be on building models with scikit-learn.

Workflow of an Azure ML Workbench Project

The first step in any ML project is loading the data. The second step is often the more laborious and time-consuming: wrangling the data. This is where Azure ML Workbench really shines. Click on the iris.csv file to see what the raw data looks like. Note that this file lacks column names. Now, click on the iris.dprep file list on the left-hand side of the screen. Note the number of steps taken to clean the data. The steps add names to the columns and remove rows where the Species column is null. Click on the down arrow to the right of the Filter Species step. In the context menu that appears click Edit to display the Filter Column dialog shown in Figure 3. The rule is set up to remove any rows where the Species column is null. Note that there are additional options to add extra conditions. This dialog box will be an invaluable tool in your data science projects, as data rarely comes in a clean format that’s consumable by ML algorithms.

The Filter Column Dialog Window
Figure 3 The Filter Column Dialog Window

After referencing various libraries and initializing the local environment, the code loads the data by executing the iris.dprep file on line 28, which loads the data file and performs all the steps defined:

iris = run('iris.dprep', dataflow_idx=0, spark=False)

The output is a pandas DataFrame with the cleaned data. A pandas DataFrame is a two-dimensional data structure similar to a table in a SQL database or in a spreadsheet. You can read more about DataFrames at bit.ly/2BlWl6K.

Now that the data has been cleaned and loaded, it’s time to separate the data into features and labels. Features make up the various fields needed to make a prediction. In this case, given the widths and lengths for Sepals and Petals, the algorithm will predict to which species of Iris the plant belongs. In this case, the features are: Sepal Length, Sepal Width, Petal Length and Petal Width. The label, or predicted value, is the Species. Line 32 in the iris_sklearn.py file separates the DataFrame into two arrays: X for the features and Y for the label, as shown here (strictly speaking, X and Y are NumPy arrays, a data structure from the NumPy library):

X, Y = iris[['Sepal Length', 'Sepal Width', 'Petal Length', 
  'Petal Width']].values, iris['Species'].values

Once the data is separated into a label and features, it’s time to separate the data into a training set and a test set. The following line of code randomly reserves 35 percent of the rows in the input data set and places it into Y_train and Y_test; the remaining 65 percent go into X_train and X_test:

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.35,

In supervised learning, the correct values to the label are known. The algorithm is trained by passing the features and label to it. The algorithm then discovers the relationships and patterns between the features and the correct label. The following line creates an ML model with a logistic regression algorithm against the training data:

clf1 = LogisticRegression(C=1/reg).fit(X_train, Y_train)

Logistic regression is a statistical method analyzing data sets where there are one or more variables that determine an outcome (bit.ly/2zQ1hVe). In this case, the dimensions of the sepals and petals determine the Iris species.

Once trained, the algorithm is tested for accuracy by calling the score method on the model:

accuracy = clf1.score(X_test, Y_test)
print ("Accuracy is {}".format(accuracy))

The best way to understand this is to actually run the code. However, before that can be done, there’s one more step. This project uses matplotlib, a popular plotting library for Python. To install it, select Open Command Prompt from the File menu. At the command line, type the following:

pip install matplotlib

Once installed, type the following to the command line:

python iris_sklearn.py

In a few moments, the output should look like Figure 4.

Output of the iris_sklearn.py Program
Figure 4 Output of the iris_sklearn.py Program

As displayed in the command window, the accuracy of the model is 0.6792452830188679, meaning that it correctly guesses the species of Iris in the test data 67.92 percent of the time.

Executing the Code from Within Azure ML Workbench

While running the code in the command line is useful, Azure ML Workbench provides a way to make this simpler and capture information about the jobs that have run. Look for the Run button. To the immediate left of it, there are two dropdowns and a textbox. By default, it should look like Figure 5. Click Run.

Running Files in Azure ML Workbench
Figure 5 Running Files in Azure ML Workbench

This executes the script locally and tracks the execution of the script via the Jobs tab in Azure ML Workbench. After the program runs, output and details about the run will appear. To see that, click on the iris_sklearn.py entry in the run list. Choose the first record in the data grid under Runs. Review the Run Properties section to see basic performance statistics of the run. Scroll down to see the Metrics and Visualization sections to see the output from the script as shown in Figure 6.

Metrics and Visualizations Shown in Azure ML Workbench
Figure 6 Metrics and Visualizations Shown in Azure ML Workbench

In my previous article, I explained how to explore the results of a job and view the job history. Please refer to that for more details (msdn.com/magazine/mt814414).

Persisting a Trained Model

While running the iris_sklearn.py script either through Azure ML Workbench or the command line, you’ll likely notice that the process takes several seconds. On my Surface Book, it takes about nine seconds. While different hardware configurations will produce different results, the process is hardly instantaneous. Most of the processing time is devoted to training the model. Fortunately, there’s little need to continually train a model. Lines 79 through 82 take the trained model and persist it to disk using the Pickle library (bit.ly/2im9w3O):

print ("Export the model to model.pkl")
f = open('./outputs/model.pkl', 'wb')
pickle.dump(clf1, f)

Lines 86 and 87 demonstrate how to restore the trained model from disk:

f2 = open('./outputs/model.pkl', 'rb')
clf2 = pickle.load(f2)

The next step is to create some sample data and use the model to predict the species, which is done on lines 89 and 98:

# Predict on a new sample
X_new = [[3.0, 3.6, 1.3, 0.25]]
print ('New sample: {}'.format(X_new))
# Add random features to match the training data
X_new_with_random_features = np.c_[X_new, random_state.randn(1, n)]
# Score on the new sample
pred = clf2.predict(X_new_with_random_features)
print('Predicted class is {}'.format(pred))

If you refer back to Figure 4, you can see toward the lower middle of the screenshot that the predicted class is [‘Iris-setosa’].

Passing Parameters

You may have noticed the Arguments textbox next to the Run button. Earlier, I left this field blank.

In the iris_sklearn.py file, lines 47 and 48, check for the presence of a parameter, convert the value to a float, and then set the reg variable to its value. If no parameters are sent to the program, the variable retains the value it was initialized with on line 45: 0.01, like so:

if len(sys.argv) > 1:
  reg = float(sys.argv[1])

The reg value gets passed as a parameter to the LogisticRegression method and sets the Regularization rate. Regularization rate controls the introduction of additional information in order to avoid overfitting the model. Overfitting occurs when the model performs too well on test data. A model that provides highly accurate results with test data will likely be unusable when given data outside of the test set in the future. More information about regularization and overfitting is available at bit.ly/2kfLU1f and bit.ly/2iatJpC, respectively.

Enter the number 10 into the Arguments textbox and click Run once more, making sure that Local and iris_sklearn.py are both selected. When completed, click on the Jobs tab, browse through it and choose iris_sklearn.py. Note that the charts now have a second data point, as shown in Figure 7.

A Second Data Point Added to the iris_sklearn.py Jobs Tab Charts
Figure 7 A Second Data Point Added to the iris_sklearn.py Jobs Tab Charts

Now, click on the files icon and click on the run.py file. This program will call the iris_sklearn.py file with a new regularization parameter that’s half the value of the previous run until the value gets below 0.005. Because I already ran the script with a parameter of 10, I would like to change line seven to five. However, as you may have already noticed, the text isn’t editable. To edit the text, click on the dropdown on the upper-left portion of the text area to switch to edit mode. There’s an option to edit in another program, such as Visual Studio Code. However, in this case, choose Edit in Workbench Text Editor (see Figure 8).

Switch to Edit Mode
Figure 8 Switch to Edit Mode

After changing line 7 to reg = 5, click on the save icon. Next, click on Open Command Prompt from the File menu and type the following command and hit enter:

python run.py

The program will run and then pass a command to the underlying OS to run the iris_sklearn.py file through the Azure Azure ML Workbench system. Notice the program output in the command-line window and inside the Jobs pane within the Azure Azure ML Workbench program:

os.system('az ml experiment submit -c local ./iris_sklearn.py {}'.format(reg))

Click on the Jobs icon in the toolbar on the left-hand side of the screen and then click the entry for iris_sklearn.py. Notice how many more data points there are in the graphs. The run.py program executed the script a number of times with a different regularization rate each time. 

Wrapping Up

In this article, I explored a common sample data set in data science with Azure ML Workbench, demonstrating the power and flexibility of the program. While not as straightforward or approachable as Azure ML Studio, Azure ML Workbench opens up a lot more possibilities to the data scientist and ML practitioner. First and foremost is the ability to install and consume any Python library, including Pickle, matplotlib and, of course, Sci-Kit Learn. There’s also a command-line interface that can accept commands to install Python libraries and run Python code as tracked jobs inside Azure ML Workbench. Tracked jobs have the added benefit of exploring the results graphically to make data experimentation faster.

Azure ML Workbench includes several more features that I will explore in future articles, such as support for Jupyter notebooks, source control integration and Docker images. The tool truly brings great capabilities to the data science field.

Frank La Vigne leads the Data & Analytics practice at Wintellect and co-hosts the DataDriven podcast. He blogs regularly at FranksWorld.com and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).

Thanks to the following technical expert for reviewing this article: Andy Leonard

Discuss this article in the MSDN Magazine forum