Create a shared pool of Data Science Virtual Machines

This article discusses how you can create a shared pool of Data Science Virtual Machines (DSVMs) for a team to use. The benefits of using a shared pool are better resource utilization, facilitation of sharing and collaboration, and more effective management of DSVM resources.

You can use many methods and technologies to create a pool of DSVMs. This article focuses on pools for batch processing and interactive VMs.

Batch-processing pool

If you want to set up a pool of DSVMs mainly to run jobs in a batch offline, you can use the Azure Batch AI or Azure Batch service. This article focuses on Azure Batch AI.

The Ubuntu edition of the DSVM is supported as one of the images in Azure Batch AI. In Azure CLI or the Python SDK, where you create the Azure Batch AI cluster, you can specify the image parameter and set it to UbuntuDSVM. You can choose what kind of processing nodes you want: GPU-based instances versus CPU-only instances, number of CPUs, and memory from a wide choice of VM instances available on Azure.

When you use the Ubuntu DSVM image in Batch AI with GPU-based nodes, all the necessary GPU drivers and deep learning frameworks are preinstalled. The preinstallation saves you considerable time in preparing the batch nodes. In fact, if you're developing on an Ubuntu DSVM interactively, you'll notice that the Batch AI nodes are exactly the same setup and configuration of the environment.

Typically when you create a Batch AI cluster, you also create a file share that is mounted by all the nodes. The file share is used for input and output of data, as well as storing the batch job code/scripts.

After you create a Batch AI cluster, you can use the same CLI or Python SDK to submit jobs to be run. You pay for only the time that's used to run the batch jobs.

For more information, see:

  • Step-by-step walkthrough of using Azure CLI to manage Batch AI
  • Step-by-step walkthrough of using Python to manage Batch AI
  • Batch AI recipes that demonstrate how to use various AI and deep learning frameworks with Batch AI

Interactive VM pool

A pool of interactive VMs that are shared by the whole AI/data science team allows users to log in to an available instance of the DSVM instead of having a dedicated instance for each set of users. This setup helps with better availability and more effective utilization of resources.

The technology that you use to create an interactive VM pool is Azure virtual machine scale sets. You can use scale sets to create and manage a group of identical, load-balanced, and autoscaling VMs.

The user logs in to the main pool's IP or DNS address. The scale set automatically routes the session to an available DSVM in the scale set. Because users want a similar environment regardless of the VM they're logging in to, all instances of the VM in the scale set mount a shared network drive, like an Azure Files share or an NFS share. The user's shared workspace is normally kept on the shared file store that's mounted on each of the instances.

You can find a sample Azure Resource Manager template that creates a scale set with Ubuntu DSVM instances on GitHub. A sample of the parameter file for the Azure Resource Manager template is in the same location.

You can create the scale set from the Azure Resource Manager template by specifying values for the parameter file in Azure CLI.

az group create --name [[NAME OF RESOURCE GROUP]] --location [[ Data center. For eg: "West US 2"]
az group deployment create --resource-group  [[NAME OF RESOURCE GROUP ABOVE]]  --template-uri --parameters @[[PARAMETER JSON FILE]]

The preceding commands assume you have:

  • A copy of the parameter file with the values specified for your instance of the scale set.
  • The number of VM instances.
  • Pointers to the Azure Files share.
  • Credentials for the storage account that will be mounted on each VM.

The parameter file is referenced locally in the commands. You can also pass parameters inline or prompt for them in your script.

The preceding template enables the SSH and the JupyterHub port from the front-end scale set to the back-end pool of Ubuntu DSVMs. As a user, you just log in to the VM on SSH or on JupyterHub in the normal way. Because the VM instances can be scaled up or down dynamically, any state needs to be saved in the mounted Azure Files share. You can use the same approach to create a pool of Windows DSVMs.

The script that mounts the Azure Files share is also available in the Azure DataScienceVM repository in GitHub. The script mounts the Azure Files share at the specified mount point in the parameter file. The script also creates soft links to the mounted drive in the initial user's home directory. A user-specific notebook directory within the Azure Files share is soft linked to the $HOME/notebooks/remote directory so that users can access, run, and save their Jupyter notebooks. You can use the same convention when you create additional users on the VM to point each user's Jupyter workspace to the Azure Files share.

Virtual machine scale sets support autoscaling. You can set rules on when to create additional instances and when to scale down instances. For example, you can scale down to zero instances to save on cloud hardware usage costs when the VMs are not used at all. The documentation pages of virtual machine scale sets provide detailed steps for autoscaling.

Next steps