What is Azure Data Science Virtual Machine for Linux and Windows?

The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud built specifically for doing data science. It has many popular data science and other tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics.

The tool configurations are rigorously tested by data scientists and developers at Microsoft and by the broader data science community to ensure stability and general viability.

The DSVM is available on:

  • Windows Server 2016, Windows Server 2012
  • Ubuntu 16.04 LTS and CentOS 7.4

All Deep Learning VM tools have been folded into Data Science VM.

What can I do with DSVM?

The goal of the Data Science Virtual Machine (DSVM) is to provide data professionals of all skill levels and across industries with a friction-free, pre-configured, and fully-integrated data science environment. Instead of rolling out a comparable workspace on your own, you can provision a DSVM - saving you days or even weeks on the installation, configuration, and package management processes. After your DSVM has been allocated, you can immediately begin working on your data science project.

The Data Science VM is designed and configured for working with a broad range of usage scenarios. You can scale your environment up or down as your project requirements change. You can also use your preferred language to program data science tasks and install other tools to customize the system for your exact needs.

Preconfigured analytics desktop in the cloud

The Data Science VM provides a baseline configuration for data science teams looking to replace their local desktops with a managed cloud desktop. This baseline ensures that all the data scientists on a team have a consistent setup with which to verify experiments and promote collaboration. It also lowers costs by reducing the sysadmin burden. This burden reduction saves on time needed to evaluate, install, and maintain the various software packages needed to do advanced analytics.

Data science training and education

Enterprise trainers and educators that teach data science classes usually provide a virtual machine image. They provide the image to ensure that their students have a consistent setup and that the samples work predictably. The Data Science VM creates an on-demand environment with a consistent setup that eases the support and incompatibility challenges. Cases where these environments need to be built frequently, especially for shorter training classes, benefit substantially.

On-demand elastic capacity for large-scale projects

Data science hackathons/competitions or large-scale data modeling and exploration require scaled out hardware capacity, typically for short duration. The Data Science VM can help replicate the data science environment quickly on demand, on scaled out servers that allow experiments that high-powered computing resources to be run.

Custom compute power for Azure Notebooks

Azure Notebooks is a free hosted service to develop, run, and share Jupyter notebooks in the cloud with no installation. The free service tier, however, is limited to 4GB of memory and 1GB of data. To release all limits, you can then attach a Notebooks project to a Data Science VM or any other VM running Jupyter server. If you sign into Azure Notebooks with an account using Azure Active Directory (such as a corporate account), Notebooks automatically shows Data Science VMs in any subscriptions associated with that account. You can attach a Data Science VM to Azure Notebooks to expand the available compute power.

Short-term experimentation and evaluation

The Data Science VM can be used to evaluate or learn tools such as Microsoft ML Server, SQL Server, Visual Studio tools, Jupyter, deep learning / ML toolkits, and new tools popular in the community with minimal setup effort. Since the Data Science VM can be set up quickly, it can be applied in other short-term usage scenarios. These scenarios include replicating published experiments, executing demos, following walkthroughs in online sessions and conference tutorials.

Deep learning

The data science VM can be used for training models using deep learning algorithms on GPU (Graphics processing units) based hardware. Utilizing VM scaling capabilities of Azure cloud, DSVM helps you use GPU-based hardware on the cloud as per need. One can switch to a GPU-based VM when training large models or need high-speed computations while keeping the same OS disk. The Windows Server 2016 edition of DSVM comes pre-installed with GPU drivers, frameworks, and GPU versions of deep learning frameworks. On the Linux edition, deep learning on GPU is enabled on both the CentOS and Ubuntu DSVMs. You can deploy the Ubuntu, CentOS, or Windows 2016 edition of Data Science VM to a non GPU-based Azure virtual machine. In this case, all the deep learning frameworks will fall back to the CPU mode. Learn more about available deep learning and AI frameworks.

Learn more about available deep learning and AI frameworks.

What's included on DSVM?

The Data Science Virtual Machine has many popular data science and deep learning tools already installed and configured. It also includes tools that make it easy to work with various Azure data and analytics products such as, Microsoft ML Server (R, Python) for building predictive models or SQL Server 2017 for large-scale data set exploration. The Data Science VM includes a host of other tools from the open-source community and from Microsoft, as well as sample code and notebooks.

Tools and platforms:

The following table itemizes and compares the main components included in the Windows and Linux editions of the Data Science Virtual Machine.

Tool Windows Edition Linux Edition
Microsoft R Open with popular packages pre-installed Y Y
Microsoft ML Server (R, Python) Developer Edition includes,
    * RevoScaleR/revoscalepy parallel and distributed high-performance framework (R & Python)
    * MicrosoftML - New state-of-the-art ML algorithms from Microsoft
    * R and Python Operationalization
Microsoft Office Pro-Plus with shared activation - Excel, Word, and PowerPoint Y N
Anaconda Python 2.7, 3.5 with popular packages pre-installed Y Y
JuliaPro with popular packages for Julia language pre-installed Y Y
Relational Databases SQL Server 2017
Developer Edition
PostgreSQL (CentOS),
SQL Server 2017
Developer Edition (Ubuntu)
Database tools * SQL Server Management Studio
* SQL Server Integration Services
* bcp, sqlcmd
* ODBC/JDBC drivers
* SQuirreL SQL (querying tool),
* bcp, sqlcmd
* ODBC/JDBC drivers
Scalable in-database analytics with SQL Server ML services (R, Python) Y N
Jupyter Notebook Server with following kernels, Y Y
    * R Y Y
    * Python Y Y
    * Julia Y Y
    * PySpark Y Y
    * Sparkmagic N Y (Ubuntu only)
    * SparkR N Y
JupyterHub (Multi-user notebook server) N Y
JupyterLab (Multi-user notebook server) N Y (Ubuntu only)
Development tools, IDEs, and Code editors
    * Visual Studio 2019 (Community Edition) with Git Plugin, Azure HDInsight (Hadoop), Data Lake, SQL Server Data tools, Node.js, Python, and R Tools for Visual Studio (RTVS) Y N
    * Visual Studio Code Y Y
    * RStudio Desktop Y Y
    * RStudio Server N Y
    * PyCharm Community Edition N Y
    * Atom N Y
    * Juno (Julia IDE) Y Y
    * Vim and Emacs Y Y
    * Git and GitBash Y Y
    * OpenJDK Y Y
    * .NET Framework Y N
Power BI Desktop Y N
SDKs to access Azure and Cortana Intelligence Suite of services Y Y
Data Movement and management Tools
    * Azure Storage Explorer Y Y
    * Azure CLI Y Y
    * Azure Powershell Y N
    * Azcopy Y N
    * Blob FUSE driver N Y
    * Adlcopy(Azure Data Lake Storage) Y N
    * DocDB Data Migration Tool Y N
    * Microsoft Data Management Gateway: Move data between OnPrem and Cloud Y N
    * Unix/Linux Command-Line Utilities Y Y
Apache Drill for Data exploration Y Y
Machine Learning Tools
    * Integration with Azure Machine Learning (R, Python) Y Y
    * Xgboost Y Y
    * Vowpal Wabbit Y Y
    * Weka Y Y
    * Rattle Y Y
    * LightGBM N Y (Ubuntu only)
    * CatBoost N Y (Ubuntu only)
    * H2O, Sparkling Water N Y (Ubuntu only)
Deep Learning Tools
All tools will work on a GPU or CPU
    * Microsoft Cognitive Toolkit (CNTK) (Windows 2016) Y Y
    * TensorFlow Y (Windows 2016) Y
    * Horovod N Y (Ubuntu)
    * MXNet Y (Windows 2016) Y
    * Caffe & Caffe2 N Y
    * Chainer N Y
    * Torch N Y
    * Theano N Y
    * Keras N Y
    * PyTorch N Y
    * NVidia Digits N Y
    * MXNet Model Server N Y
    * TensorFlow Serving N Y
    * TensorRT N Y
    * CUDA, cuDNN, NVIDIA Driver Y Y

Next steps

Learn more with these articles: