Introduction to Azure Data Science Virtual Machine for Linux and Windows

The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud built specifically for doing data science. It has many popular data science and other tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics. It is available on Windows Server and Linux. We offer Windows edition of DSVM on Server 2016 and Server 2012. We offer Linux editions of the DSVM on Ubuntu 16.04 LTS and CentOS 7.4.

This topic discusses what you can do with the Data Science VM, outlines some of the key scenarios for using the VM, itemizes the key features available on the Windows and Linux versions, and provides instructions on how to get started using them.

What can I do with the Data Science Virtual Machine?

The goal of the Data Science Virtual Machine (DSVM) is to provide data professionals of all skill levels and across industries with a friction-free, pre-configured, and fully-integrated data science environment. Instead of rolling out a comparable workspace on your own, you can provision a DSVM - saving you days or even weeks on the installation, configuration, and package management processes. After your DSVM has been allocated, you can immediately begin working on your data science project.

The Data Science VM is designed and configured for working with a broad range of usage scenarios. You can scale your environment up or down as your project needs change, use your preferred language to program data science tasks and install other tools to customize the system for your exact needs.

Key Scenarios

This section suggests some key scenarios for which the Data Science VM can be deployed.

Preconfigured analytics desktop in the cloud

The Data Science VM provides a baseline configuration for data science teams looking to replace their local desktops with a managed cloud desktop. This baseline ensures that all the data scientists on a team have a consistent setup with which to verify experiments and promote collaboration. It also lowers costs by reducing the sysadmin burden and saving on the time needed to evaluate, install, and maintain the various software packages needed to do advanced analytics.

Data science training and education

Enterprise trainers and educators that teach data science classes usually provide a virtual machine image to ensure that their students have a consistent setup and that the samples work predictably. The Data Science VM creates an on-demand environment with a consistent setup that eases the support and incompatibility challenges. Cases where these environments need to be built frequently, especially for shorter training classes, benefit substantially.

On-demand elastic capacity for large-scale projects

Data science hackathons/competitions or large-scale data modeling and exploration require scaled out hardware capacity, typically for short duration. The Data Science VM can help replicate the data science environment quickly on demand, on scaled out servers that allow experiments requiring high-powered computing resources to be run.

Short-term experimentation and evaluation

The Data Science VM can be used to evaluate or learn tools such as Microsoft ML Server, SQL Server, Visual Studio tools, Jupyter, deep learning / ML toolkits, and new tools popular in the community with minimal setup effort. Since the Data Science VM can be set up quickly, it can be applied in other short-term usage scenarios like replicating published experiments, executing demos, following walkthroughs in online sessions and conference tutorials.

Deep learning

The data science VM can be used for training models using deep learning algorithms on GPU (Graphics processing units) based hardware. Utilizing VM scaling capabilities of Azure cloud, DSVM helps you use GPU-based hardware on the cloud as per need. One can switch to a GPU-based VM when training large models or need high-speed computations while keeping the same OS disk. The Windows Server 2016 edition of DSVM comes pre-installed with GPU drivers, frameworks, and GPU versions of deep learning frameworks. On the Linux edition, deep learning on GPU is enabled on both the CentOS and Ubuntu DSVMs. You can deploy the Ubuntu, CentOS, or Windows 2016 edition of Data Science VM to a non GPU-based Azure virtual machine in which case all the deep learning frameworks will fall back to the CPU mode.

What's included in the Data Science VM?

The Data Science Virtual Machine has many popular data science and deep learning tools already installed and configured. It also includes tools that make it easy to work with various Azure data and analytics products such as, Microsoft ML Server (R, Python) for building predictive models or SQL Server 2017 for large-scale data set exploration. A host of other tools from the open source community and from Microsoft are also included, as well as sample code and notebooks. The following table itemizes and compares the main components included in the Windows and Linux editions of the Data Science Virtual Machine.

Tool Windows Edition Linux Edition
Microsoft R Open with popular packages pre-installed Y Y
Microsoft ML Server (R, Python) Developer Edition includes,
    * RevoScaleR/revoscalepy parallel and distributed high-performance framework (R & Python)
    * MicrosoftML - New state-of-the-art ML algorithms from Microsoft
    * R and Python Operationalization
Microsoft Office Pro-Plus with shared activation - Excel, Word and PowerPoint Y N
Anaconda Python 2.7, 3.5 with popular packages pre-installed Y Y
JuliaPro with popular packages for Julia language pre-installed Y Y
Relational Databases SQL Server 2017
Developer Edition
PostgreSQL (CentOS),
SQL Server 2017
Developer Edition (Ubuntu)
Database tools * SQL Server Management Studio
* SQL Server Integration Services
* bcp, sqlcmd
* ODBC/JDBC drivers
* SQuirreL SQL (querying tool),
* bcp, sqlcmd
* ODBC/JDBC drivers
Scalable in-database analytics with SQL Server ML services (R, Python) Y N
Jupyter Notebook Server with following kernels, Y Y
    * R Y Y
    * Python Y Y
    * Julia Y Y
    * PySpark Y Y
    * Sparkmagic N Y (Ubuntu only)
    * SparkR N Y
JupyterHub (Multi-user notebook server) N Y
JupyterLab (Multi-user notebook server) N Y (Ubuntu only)
Development tools, IDEs and Code editors
    * Visual Studio 2017 (Community Edition) with Git Plugin, Azure HDInsight (Hadoop), Data Lake, SQL Server Data tools, Node.js, Python, and R Tools for Visual Studio (RTVS) Y N
    * Visual Studio Code Y Y
    * RStudio Desktop Y Y
    * RStudio Server N Y
    * PyCharm Community Edition N Y
    * Atom N Y
    * Juno (Julia IDE) Y Y
    * Vim and Emacs Y Y
    * Git and GitBash Y Y
    * OpenJDK Y Y
    * .Net Framework Y N
PowerBI Desktop Y N
SDKs to access Azure and Cortana Intelligence Suite of services Y Y
Data Movement and management Tools
    * Azure Storage Explorer Y Y
    * Azure CLI Y Y
    * Azure Powershell Y N
    * Azcopy Y N
    * Blob FUSE driver N Y
    * Adlcopy(Azure Data Lake Storage) Y N
    * DocDB Data Migration Tool Y N
    * Microsoft Data Management Gateway: Move data between OnPrem and Cloud Y N
    * Unix/Linux Command-Line Utilities Y Y
Apache Drill for Data exploration Y Y
Machine Learning Tools
    * Integration with Azure Machine Learning (R, Python) Y Y
    * Xgboost Y Y
    * Vowpal Wabbit Y Y
    * Weka Y Y
    * Rattle Y Y
    * LightGBM N Y (Ubuntu only)
    * CatBoost N Y (Ubuntu only)
    * H2O, Sparkling Water N Y (Ubuntu only)
Deep Learning Tools
All tools will work on a GPU or CPU
    * Microsoft Cognitive Toolkit (CNTK) (Windows 2016) Y Y
    * TensorFlow Y (Windows 2016) Y
    * Horovod N Y (Ubuntu)
    * MXNet Y (Windows 2016) Y
    * Caffe & Caffe2 N Y
    * Chainer N Y
    * Torch N Y
    * Theano N Y
    * Keras N Y
    * PyTorch N Y
    * NVidia Digits N Y
    * MXNet Model Server N Y
    * TensorFlow Serving N Y
    * TensorRT N Y
    * CUDA, cuDNN, NVIDIA Driver Y Y
Big Data Platform (Devtest only)
    * Local Spark Standalone Y Y
    * Local Hadoop (HDFS, YARN) N Y

Get started

Windows Data Science VM

Linux Data Science VM

Next steps

R developer's guide to Azure