Introduction to Azure Data Science Virtual Machine for Linux and Windows
The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud built specifically for doing data science. It has many popular data science and other tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics. It is available on Windows Server and Linux. We offer Windows edition of DSVM on Server 2016 and Server 2012. We offer Linux editions of the DSVM on Ubuntu 16.04 LTS and CentOS 7.4.
This article discusses what you can do with the Data Science VM. It outlines some of the key scenarios for using the VM and itemizes the key features available on the Windows and Linux versions. The article also provides instructions on how to get started using them.
What can I do with the Data Science Virtual Machine?
The goal of the Data Science Virtual Machine (DSVM) is to provide data professionals of all skill levels and across industries with a friction-free, pre-configured, and fully-integrated data science environment. Instead of rolling out a comparable workspace on your own, you can provision a DSVM - saving you days or even weeks on the installation, configuration, and package management processes. After your DSVM has been allocated, you can immediately begin working on your data science project.
The Data Science VM is designed and configured for working with a broad range of usage scenarios. You can scale your environment up or down as your project requirements change. You can also use your preferred language to program data science tasks and install other tools to customize the system for your exact needs.
This section suggests some key scenarios for which the Data Science VM can be deployed.
Preconfigured analytics desktop in the cloud
The Data Science VM provides a baseline configuration for data science teams looking to replace their local desktops with a managed cloud desktop. This baseline ensures that all the data scientists on a team have a consistent setup with which to verify experiments and promote collaboration. It also lowers costs by reducing the sysadmin burden. This burden reduction saves on time needed to evaluate, install, and maintain the various software packages needed to do advanced analytics.
Data science training and education
Enterprise trainers and educators that teach data science classes usually provide a virtual machine image. They provide the image to ensure that their students have a consistent setup and that the samples work predictably. The Data Science VM creates an on-demand environment with a consistent setup that eases the support and incompatibility challenges. Cases where these environments need to be built frequently, especially for shorter training classes, benefit substantially.
On-demand elastic capacity for large-scale projects
Data science hackathons/competitions or large-scale data modeling and exploration require scaled out hardware capacity, typically for short duration. The Data Science VM can help replicate the data science environment quickly on demand, on scaled out servers that allow experiments that high-powered computing resources to be run.
Custom compute power for Azure Notebooks
Azure Notebooks is a free hosted service to develop, run, and share Jupyter notebooks in the cloud with no installation. The free service tier, however, is limited to 4GB of memory and 1GB of data. To release all limits, you can then attach a Notebooks project to a Data Science VM or any other VM running Jupyter server. If you sign into Azure Notebooks with an account using Azure Active Directory (such as a corporate account), Notebooks automatically shows Data Science VMs in any subscriptions associated with that account. For more information, see Manage and configure projects - Compute tier.
Short-term experimentation and evaluation
The Data Science VM can be used to evaluate or learn tools such as Microsoft ML Server, SQL Server, Visual Studio tools, Jupyter, deep learning / ML toolkits, and new tools popular in the community with minimal setup effort. Since the Data Science VM can be set up quickly, it can be applied in other short-term usage scenarios. These scenarios include replicating published experiments, executing demos, following walkthroughs in online sessions and conference tutorials.
The data science VM can be used for training models using deep learning algorithms on GPU (Graphics processing units) based hardware. Utilizing VM scaling capabilities of Azure cloud, DSVM helps you use GPU-based hardware on the cloud as per need. One can switch to a GPU-based VM when training large models or need high-speed computations while keeping the same OS disk. The Windows Server 2016 edition of DSVM comes pre-installed with GPU drivers, frameworks, and GPU versions of deep learning frameworks. On the Linux edition, deep learning on GPU is enabled on both the CentOS and Ubuntu DSVMs. You can deploy the Ubuntu, CentOS, or Windows 2016 edition of Data Science VM to a non GPU-based Azure virtual machine. In this case, all the deep learning frameworks will fall back to the CPU mode.
What's included in the Data Science VM?
The Data Science Virtual Machine has many popular data science and deep learning tools already installed and configured. It also includes tools that make it easy to work with various Azure data and analytics products such as, Microsoft ML Server (R, Python) for building predictive models or SQL Server 2017 for large-scale data set exploration. The Data Science VM includes a host of other tools from the open-source community and from Microsoft, as well as sample code and notebooks. The following table itemizes and compares the main components included in the Windows and Linux editions of the Data Science Virtual Machine.
|Tool||Windows Edition||Linux Edition|
|Microsoft R Open with popular packages pre-installed||Y||Y|
|Microsoft ML Server (R, Python) Developer Edition includes,
* RevoScaleR/revoscalepy parallel and distributed high-performance framework (R & Python)
* MicrosoftML - New state-of-the-art ML algorithms from Microsoft
* R and Python Operationalization
|Microsoft Office Pro-Plus with shared activation - Excel, Word, and PowerPoint||Y||N|
|Anaconda Python 2.7, 3.5 with popular packages pre-installed||Y||Y|
|JuliaPro with popular packages for Julia language pre-installed||Y||Y|
|Relational Databases||SQL Server 2017
SQL Server 2017
Developer Edition (Ubuntu)
|Database tools||* SQL Server Management Studio
* SQL Server Integration Services
* bcp, sqlcmd
* ODBC/JDBC drivers
|* SQuirreL SQL (querying tool),
* bcp, sqlcmd
* ODBC/JDBC drivers
|Scalable in-database analytics with SQL Server ML services (R, Python)||Y||N|
|Jupyter Notebook Server with following kernels,||Y||Y|
|* Sparkmagic||N||Y (Ubuntu only)|
|JupyterHub (Multi-user notebook server)||N||Y|
|JupyterLab (Multi-user notebook server)||N||Y (Ubuntu only)|
|Development tools, IDEs, and Code editors|
|* Visual Studio 2019 (Community Edition) with Git Plugin, Azure HDInsight (Hadoop), Data Lake, SQL Server Data tools, Node.js, Python, and R Tools for Visual Studio (RTVS)||Y||N|
|* Visual Studio Code||Y||Y|
|* RStudio Desktop||Y||Y|
|* RStudio Server||N||Y|
|* PyCharm Community Edition||N||Y|
|* Juno (Julia IDE)||Y||Y|
|* Vim and Emacs||Y||Y|
|* Git and GitBash||Y||Y|
|* .NET Framework||Y||N|
|Power BI Desktop||Y||N|
|SDKs to access Azure and Cortana Intelligence Suite of services||Y||Y|
|Data Movement and management Tools|
|* Azure Storage Explorer||Y||Y|
|* Azure CLI||Y||Y|
|* Azure Powershell||Y||N|
|* Blob FUSE driver||N||Y|
|* Adlcopy(Azure Data Lake Storage)||Y||N|
|* DocDB Data Migration Tool||Y||N|
|* Microsoft Data Management Gateway: Move data between OnPrem and Cloud||Y||N|
|* Unix/Linux Command-Line Utilities||Y||Y|
|Apache Drill for Data exploration||Y||Y|
|Machine Learning Tools|
|* Integration with Azure Machine Learning (R, Python)||Y||Y|
|* Vowpal Wabbit||Y||Y|
|* LightGBM||N||Y (Ubuntu only)|
|* CatBoost||N||Y (Ubuntu only)|
|* H2O, Sparkling Water||N||Y (Ubuntu only)|
|Deep Learning Tools
All tools will work on a GPU or CPU
|* Microsoft Cognitive Toolkit (CNTK) (Windows 2016)||Y||Y|
|* TensorFlow||Y (Windows 2016)||Y|
|* Horovod||N||Y (Ubuntu)|
|* MXNet||Y (Windows 2016)||Y|
|* Caffe & Caffe2||N||Y|
|* NVidia Digits||N||Y|
|* MXNet Model Server||N||Y|
|* TensorFlow Serving||N||Y|
|* CUDA, cuDNN, NVIDIA Driver||Y||Y|
|Big Data Platform (Devtest only)|
|* Local Spark Standalone||Y||Y|
|* Local Hadoop (HDFS, YARN)||N||Y|
Windows Data Science VM
- For more information on how to create a Windows DSVM and use it, see Provision the Windows Data Science Virtual Machine. For more information on how to perform various tasks needed for your data science project on the Windows DSVM, see Ten things you can do on the Data Science Virtual Machine.
Linux Data Science VM
- For more information on how to create an Ubuntu DSVM and use it, see Provision the Data Science Virtual Machine for Linux (Ubuntu). For more information on how to create a CentOS DSVM and use it, see Provision a Linux CentOS Data Science Virtual Machine on Azure.
- For a walkthrough that shows you how to perform several common data science tasks with the Linux VM, both CentOS and Ubuntu, see Data science on the Linux Data Science Virtual Machine.