Introduction to the cloud-based Data Science Virtual Machine for Linux and Windows

The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud built specifically for doing data science. It has many popular data science and other tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics. It is available on Windows Server 2012 and on Linux. We offer Linux edition of the DSVM in either Ubuntu 16.04 LTS or on OpenLogic 7.2 CentOS-based Linux distributions.

This topic discusses what you can do with the Data Science VM, outlines some of the key scenarios for using the VM, itemizes the key features available on the Windows and Linux versions, and provides instructions on how to get started using them.

What can I do with the Data Science Virtual Machine?

The goal of the Data Science Virtual Machine is to provide data professionals at all skill levels and roles with a friction-free data science environment. This VM saves you considerable time that you would spend if you had rolled out a comparable environment on your own. Instead, start your data science project immediately in a newly created VM instance.

The Data Science VM is designed and configured for working with a broad usage scenarios. You can scale your environment up or down as your project needs change. You are able to use your preferred language to program data science tasks. You can install other tools and customize the system for your exact needs.

Key Scenarios

This section suggests some key scenarios for which the Data Science VM can be deployed.

Preconfigured analytics desktop in the cloud

The Data Science VM provides a baseline configuration for data science teams looking to replace their local desktops with a managed cloud desktop. This baseline ensures that all the data scientists on a team have a consistent setup with which to verify experiments and promote collaboration. It also lowers costs by reducing the sysadmin burden and saving on the time needed to evaluate, install, and maintain the various software packages needed to do advanced analytics.

Data science training and education

Enterprise trainers and educators that teach data science classes usually provide a virtual machine image to ensure that their students have a consistent setup and that the samples work predictably. The Data Science VM creates an on-demand environment with a consistent setup that eases the support and incompatibility challenges. Cases where these environments need to be built frequently, especially for shorter training classes, benefit substantially.

On-demand elastic capacity for large-scale projects

Data science hackathons/competitions or large-scale data modeling and exploration require scaled out hardware capacity, typically for short duration. The Data Science VM can help replicate the data science environment quickly on demand, on scaled out servers that allow experiments requiring high-powered computing resources to be run.

Short-term experimentation and evaluation

The Data Science VM can be used to evaluate or learn tools such as Microsoft R Server, SQL Server, Visual Studio tools, Jupyter, deep learning / ML toolkits, and new tools popular in the community with minimal setup effort. Since the Data Science VM can be set up quickly, it can be applied in other short-term usage scenarios such as replicating published experiments, executing demos, following walkthroughs in online sessions or conference tutorials.

Deep learning

The data science VM can be used for training model using deep learning algorithms on GPU (Graphics processing units) based hardware. DSVM helps you use GPU based hardware on the cloud only as needed when you have to train large models or you need high speed computations that take advantage of the power of a GPU. On the Windows, we currently provide the Deep Learning toolkit for DSVM as a separate add-on on the top of the DSVM. This addon automatically installs the GPU drivers, frameworks and GPU version of the deep learning algorithms while creating your VM instance. On the Linux, deep learning on GPU is enabled only on the Data Science Virtual Machine for Linux (Ubuntu) edition. You can deploy the Ubuntu edition of Data Science VM to non GPU-based Azure virtual machine in which case all the deep learning frameworks fallback to the CPU mode. The CentOS-based Linux edition of the DSVM contains only the CPU builds of some of the deep learning tools (CNTK, Tensorflow, MXNet) but does not come preinstalled with the GPU drivers and frameworks.

What's included in the Data Science VM?

The Data Science Virtual Machine has many popular data science and deep learning tools already installed and configured. It also includes tools that make it easy to work with various Azure data and analytics products. You can explore and build predictive models on large-scale data sets using the Microsoft R Server or using SQL Server 2016. A host of other tools from the open source community and from Microsoft are also included, as well as sample code and notebooks. The following table itemizes and compares the main components included in the Windows and Linux editions of the Data Science Virtual Machine.

Tool Windows Edition Linux Edition
Microsoft R Open with popular packages pre-installed Y Y
Microsoft R Server Developer Edition includes,
    * ScaleR parallel and distributed high performance R framework
    * MicrosoftML - New state-of-the-art ML algorithms from Microsoft
    * R Operationalization
Y Y
(MicrosoftML not yet available)
Anaconda Python 2.7, 3.5 with popular packages pre-installed Y Y
JuliaPro with popular packages for Julia language pre-installed Y Y
Relational Databases SQL Server 2016 SP1
Developer Edition
PostgreSQL
Database tools * SQL Server Management Studio
* SQL Server Integration Services
* bcp, sqlcmd
* ODBC/JDBC drivers
* SQuirreL SQL (querying tool),
* bcp, sqlcmd
* ODBC/JDBC drivers
Scalable in-database analytics with SQL Server R services Y N
Jupyter Notebook Server with following kernels, Y Y
    * R Y Y
    * Python 2.7 & 3.5 Y Y
    * Julia Y Y
    * PySpark N Y
    * Sparkmagic N Y (Ubuntu Only)
    * SparkR N Y
JupyterHub (Multi-user notebooks server) N Y
Development tools, IDEs and Code editors
    * Visual Studio 2015 (Community Edition) >with Git Plugin, Azure HDInsight (Hadoop), Data Lake, SQL Server Data tools, Node.js, Python, and R Tools for Visual Studio (RTVS) Y N
    * Visual Studio Code Y Y
    * RStudio Desktop Y Y
    * RStudio Server N Y
    * PyCharm N Y
    * Atom N Y
    * Juno (Julia IDE) Y Y
    * Vim and Emacs Y Y
    * Git and GitBash Y Y
    * OpenJDK Y Y
    * .Net Framework Y N
PowerBI Desktop Y N
SDKs to access Azure and Cortana Intelligence Suite of services Y Y
Data Movement and management Tools
    * Azure Storage Explorer Y Y
    * Azure CLI Y Y
    * Azure Powershell Y N
    * Azcopy Y N
    * Adlcopy(Azure Data Lake Storage) Y N
    * DocDB Data Migration Tool Y N
    * Microsoft Data Management Gateway : Move data between OnPrem and Cloud Y N
    * Unix/Linux Command Line Utilities Y Y
Apache Drill for Data exploration Y Y
Machine Learning Tools
    * Integration with Azure Machine Learning (R, Python) Y Y
    * Xgboost Y Y
    * Vowpal Wabbit Y Y
    * Weka Y Y
    * Rattle Y Y
    * LightGBM N Y (Ubuntu Only)
    * H2O N Y (Ubuntu only)
GPU based Deep Learning Tools Use Deep Learning Toolkit for DSVM Ubuntu Edition Only
    * Microsoft Cognitive Toolkit (CNTK) Y Y
    * Tensorflow Y Y
    * MXNet Y Y
    * Caffe & Caffe2 N Y
    * Torch N Y
    * Theano N Y
    * Keras N Y
    * NVidia Digits N Y
    * CUDA, CUDNN, Nvidia Driver Y Y
Big Data Platform (Devtest only)
    * Local Spark Standalone N Y
    * Local Hadoop (HDFS, YARN) N Y

Get started with the Windows Data Science VM

  • Create an instance of the VM on Windows by navigating to this page and selecting the green Create Virtual Machine button.
  • Sign in to the VM from your remote desktop using the credentials you specified when you created the VM.
  • To discover and launch the tools available, click the Start menu.

Get started with the Linux Data Science VM

  • Create an instance of the VM on Linux
    • For the OpenLogic CentOS-based edition navigate to this page and select the Get it now button.
    • For the Ubuntu edition navigate to this page and select the Get it now button.
  • Sign in to the VM from an SSH client, such as Putty or SSH Command, using the credentials you specified when you created the VM.
  • In the shell prompt, enter dsvm-more-info.
  • For a graphical desktop, download the X2Go client for your client platform here and follow the instructions in the Linux Data Science VM document Provision the Linux Data Science Virtual Machine.

Next steps

For the Windows Data Science VM

For the Linux Data Science VM