Volume 33 Number 2
Using Jupyter Notebooks
By Frank La | February 2018
The Jupyter Notebook is an open source, browser-based tool that allows users to create and share documents that contain live code, visualizations and text. Jupyter Notebooks are not application development environments, per se. Rather, they provide an interactive “scratch pad” where data can be explored and experimented with. They offer a browser-based, interactive shell for various programming languages, such as Python or R, and provide data scientists and engineers a way to quickly experiment with data by providing a platform to share code, observations and visualizations. Notebooks can be run locally on a PC or in the cloud through various services, and are a great way to explore data sets and get a feel for a new programming language. For developers accustomed to a more traditional IDE, however, they can be bewildering at first.
Structure of a Jupyter Notebook
Jupyter Notebooks consist of a series of “cells” arranged in a linear sequence. Each cell can either be text or code. Text cells are formatted in MarkDown, a lightweight markup language with plain text formatting syntax. Code cells contain code in the language associated with the particular notebook. Python notebooks can execute only Python code and not R, while an R notebook can execute R and not Python.
Figure 1 shows the IntroToJupyterPython notebook available as a sample file on the Data Science Virtual Machine. Note the language indicator for the notebook in the upper right corner of the browser window, showing that the notebook is attached to the Python 2 runtime. The circle to the right of “Python 2” indicates the current state of the kernel. A filled circle indicates that the kernel is in use and a program is executing. A hollow circle indicates the kernel is idle. Also take note that the main body of the notebook contains text as well as code and a graphed plot.
Figure 1 Tutorial Notebook Introducing the Core Features of a Jupyter Notebook
Creating a Jupyter Notebook in Azure Notebooks
There are several options to run a Jupyter notebook. However, the fastest way to get started is by using the Azure Notebook service, which is in preview mode at the time of this writing. Browse over to notebooks.azure.com and sign in with your Microsoft ID credentials. If prompted, grant the application the permissions it asks for. For first time users, the site will prompt you for a public user ID. This will create a URL to host your profile and to share notebooks. If you do not wish to set this up at this time, click “No Thanks.” Otherwise, enter a value and click Save.
The screen will now show your profile page. The Azure Notebook service stores Jupyter Notebooks in Libraries. In order to create a notebook, first you must create a library. Under Libraries, there’s a button to add a library. Click on it to create a new library. In the dialog box that follows, enter a name for the Library and an ID for the URL to share it. If you wish to make this library public, check the box next to “Public Library.” Checking or unchecking the box next to “Create a README.md” will automatically insert a README.md file for documentation purposes. Click Create to create a new library.
Now, your profile page will have one library listed. Click on it to bring up the contents of the library. Right now, the only item is the README.md file. Click on the New button to add a new item. In the ensuing dialog, enter a name for the new item and choose Python 3.6 Notebook from the drop down list next to Item Type and click New.
Once the item is created, it will appear in the library with a .IPYNB file extension. Click on it to launch an instance of the Jupyter server in Azure. Note that a new browser tab or window will open and that the interface looks more like the screen in Figure 1. Click inside the text box and write the following code:
Choose Run Cells from the Cell menu at the top of the page. The screen should look like Figure 2.
Figure 2 Hello World! in a Jupyter Notebook
Click inside the blank cell that Jupyter added and choose Cell > Cell Type > Markdown from the menu bar. Then add the following text.
# This is markdown text # Markdown is a versatile and lightweight markup language.
Click the save icon and close the browser tab. Back in the Library window, click on the notebook file again. The page will reload and the markdown formatting will take effect.
Next, add another code cell, by clicking in the markdown cell and choosing Insert Cell Below from the Insert menu. Previously, I stated that only Python code could be executed in Python notebook. That is not entirely true, as you can use the “!” command to issue shell commands. Enter the following command into this new cell.
! ls -l
Choose Run Cells from the Run menu or click the icon with the Play/Pause symbol on it. The command returns a listing of the contents of the directory, which contains the notebook file and the README.md file. Once again, Jupyter added a blank cell after the response. Type the following code into the blank cell:
%matplotlib inline import numpy as np import matplotlib.pyplot as plt x = np.random.rand(100) y = np.random.rand(100) plt.scatter(x, y) plt.show()
Run the cell and after a moment a scatter plot will appear in the results. For those familiar with Python, the first line of code may look unfamiliar, as it is part of the IPython kernel which executes Python code in a Jupyter notebook. The command %matplotlib inline instructs the IPython runtime to display graphs generated by matplotlib in-line with the results. This type of command, known as a “magic” command, starts with “%”. A full exploration of magic commands warrants its own article. For further information on magic commands, refer to the IPython documentation at bit.ly/2CfiMvh.
For those not familiar with Python, the previous code segment imports two libraries, NumPy and Matplotlib. NumPy is a Python package for scientific computing (numpy.org) and Matplotlib is a popular 2D graph plotting library for Python (matplotlib.org). The code then generates two arrays of 100 random numbers and plots the results as a scatter plot. The final line of code displays the graph. As the numbers are randomly generated, the graph will change slightly each time the cell is executed.
Notebooks in ML Workbench
So far, I have demonstrated running Jupyter networks as part of the Azure Notebook service in the cloud. However, Jupyter notebooks can be run locally, as well. In fact, Jupyter notebooks are integrated into the Azure Machine Learning Workbench product. In my previous article, I demonstrated the sample Iris Classification project. While the article did not mention it, there is a notebook included with the project that details all the steps needed to create a model, as well as a 3D plot of the iris dataset.
To view it, open the Iris Classifier sample project from last month’s column, “Creating Models in Azure ML Workbench” (msdn.com/magazine/mt814992). If you did not create the project, follow the directions in the article to create a project from the template. Inside the project, shown in Figure 3, click on the third icon (1) from the top of the vertical toolbar on the left-hand side of the window, then click on iris (2) in the file list.
Figure 3 Viewing Notebooks in an Azure Machine Learning Workbench Project
The notebook file loads, but the notebook server is not running—the results shown are cached from a previous execution. To make the notebook interactive, click on the Start Notebook Server (3) button to activate the local notebook server.
Scroll down to the empty cell immediately following the 3D graph and enter the following code to view the first five records in the iris data set:
Choose Insert Cell Below from the Insert menu and enter the following code into the empty cell to display the correlation matrix:
The output should look like Figure 4.
Figure 4 Correlation Matrix for the Iris Data Set
A correlation matrix displays the correlation coefficient between various fields in a data set. A correlation coefficient measures the linear dependence between two variables, with values closer to 1 indicating a positive correlation and values closer to -1 indicating a negative correlation. Values closer to 0 indicate a lack of correlation between the two fields. For example, there’s a strong correlation between Petal Width and Petal Length with a value of 0.962757. On the other hand, the correlation between Sepal Width and Sepal Length is much weaker with a value of -0.109369. Naturally, each field has a 1.0 correlation with itself.
Thus far, I’ve only used Jupyter notebooks as part of either a Microsoft cloud service or locally using Microsoft software. However, Jupyter is open source and can run independent of the Microsoft ecosystem. One popular toolset is Anaconda (anaconda.com/download), an open source distribution of the Python and R for Windows, Mac and Linux. Jupyter ships as part of this install. Running Jupyter locally initializes a Web server locally on port 8888. Note that, on my system, I can only create a Python 3 notebook as that is the only kernel I have installed on my PC.
Data Science Virtual Machines
Running a Jupyter notebook server locally is ideal for scenarios where Internet access isn’t reliable or guaranteed. For more compute-intensive tasks, it may be wiser to create a virtual machine and run Jupyter on more powerful hardware. To make this task easier, Azure offers the Data Science Virtual Machine image for both Windows and Linux, with the most popular data science tools already installed.
Creating a VM from this image is fast and simple. From the Azure Portal, click on the New icon and search for Data Science Virtual Machine” There are several options available. However, I’ve found that the Ubuntu image is the most feature-packed. Choose the Data Science Virtual Machine for Linux (Ubuntu) image and create a virtual machine by following the steps in the wizard. Once the machine is up and running, configure the VM for remote desktop access. Refer to documentation on how to connect to a Linux VM at bit.ly/2qgHOZo.
When connected to the machine, double-click on the Jupyter icon on the desktop. A terminal window will open, followed by a browser window a moment later. When clicking on the New button to create a new notebook, you have quite a few more choices of environments and languages, as demonstrated in Figure 5.
Figure 5 Runtimes Available for the Data Science Virtual Machine for Ubuntu
Along with the various runtime environments, the Data Science Virtual Machine for Ubuntu includes numerous sample notebooks. These notebooks provide guidance on everything from the basics of Azure ML to more advanced topics like CNTK and TensorFlow.
Jupyter notebooks are an essential tool for data science work, but they tend to confuse many developers because the platform lacks the basic features needed to develop software. This is by design. Jupyter notebooks are not intended for that task.
What notebooks do is provide a collaborative mechanism where data scientists can explore data sets, experiment with different hypotheses and share observations with colleagues. Jupyter notebooks can run locally on a PC, Mac or Linux. Azure ML Workbench even includes a notebook server embedded into the product for easier experimentation with data. Notebooks can also be run in the cloud as part of a service, such as Azure Notebooks, or on a VM with more capable hardware.
Frank La Vigne leads the Data & Analytics practice at Wintellect and co-hosts the DataDriven podcast. He blogs regularly at FranksWorld.com and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).
Thanks to the following technical experts for reviewing this article: Andy Leonard