Exploring Azure Data with Apache Drill, Now Pre-Installed on the Microsoft Data Science Virtual Machine

This post is authored by Gopi Kumar, Principal Program Manager in Microsoft's Data Group.

We recently came across Apache Drill, a very interesting data analytics tool. The introduction page to Drill describes it well:

"Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data coming from modern Big Data applications, while still providing the familiarity and ecosystem of ANSI SQL, the industry-standard query language.".

Drill supports several data sources ranging from flat files, RDBMS, NoSQL databases, Hadoop/Hive stored on local server/desktop or cloud platforms like Azure and AWS. It supports querying various formats like CSV/TSV, JSON, relational tables, etc. all from the familiar ANSI SQL language (SQL remains one of the most popular languages used in data science and analytics). The best part of querying data with Drill is that the data stays in the original source and you can join data across multiple sources. Drill is designed for low latency and high throughput, and can scale from a single machine to thousands of nodes.

We are excited to announce that Apache Drill is now pre-installed on the Data Science Virtual Machine (DSVM). The DSVM is Microsoft's custom virtual machine image on Azure, pre-installed and configured with a host of popular tools that are commonly used in data science, machine learning and AI. Think of DSVM as an analytics desktop in the cloud, serving both beginners as well as advanced data scientists, analysts and engineers.

Azure already provides several data services to store and process analytical data ranging from blobs, files, relational databases, NoSQL databases, and Big Data technologies supporting varied types of data, scaling / performance needs and price points. We wanted to demonstrate how easy it is to setup Drill to explore data stored on four different Azure data services – Azure Blob Storage, Azure SQL Data Warehouse, Azure DocumentDB (a managed NoSQL database) and Azure HDInsight (i.e. managed Hadoop) Hive tables.

Towards that end, we've published a tutorial on the Cortana Intelligence Gallery that walks you through the installation and how to query data with Drill. the tutorial that will guide you through the steps to set up connections from Drill to different Azure Data services.

Drill also provides an ODBC/JDBC interface, allowing you to perform data exploration on your favorite BI tool such as Excel, Power BI or Tableau, using SQL queries. You can also query data from any programming language such as R or Python with ODBC/JDBC interfaces.

While on the Data Science Virtual Machine, we encourage you to also take a look at other useful tools and samples that come pre-built. If you're new to the DSVM (which is available in Windows and Linux editions, plus a deep learning extension to run on Nvidia GPUs), we invite you to give the DSVM a try through a Azure free trial. We also have a timed test drive, available for the Linux DSVM now, that does not require an Azure account. You will find more resources to get you started with the DSVM below.

In summary, Apache Drill can be a powerful tool in your arsenal, and can help you be nimbler with your data science projects and gain faster business insights on your big data. Data scientists and analysts can now start exploring data in its native store without having to wait for ETL pipelines to be built, and without having to do extensive data prep or client side coding to bring together data from multiple sources. This can be a huge boost to your teams' agility and productivity.



Windows Edition:

Linux Edition:


https://channel9.msdn.com/blogs/Cloud-and-Enterprise-Premium/Inside-the-Data-Science-Virtual-Machine (Duration: 1 Hour)