Configuration of Azure Blob Storage (aka WASB) as a Drill Data Source
NOTE This post is part of a series on a deployment of Apache Drill on the Azure cloud.
Azure Storage Blobs - aka WASB after the WASB in Hadoop for accessing it - provide a low-cost means to store files in Azure. As Drill is commonly used to query data residing in file systems, it would make sense to configure a Drill cluster deployed in Azure to read from WASB.
The Apache Drill web site identifies Microsoft Azure Storage as a data source but unfortunately provides no documentation about configuring Drill to use WASB. After a lot of trial and error and reaching out a consultant at Veracity Group who I saw had posed a question in a Drill forum on this topic, I've been able to successfully get my Drill cluster to use WASB. Here are the high-level steps I performed to achieve this:
- Download the WASB JAR files to each of my Drill VMs
- Edit the core-site.xml file on each VM to hold the key to my storage account
- Configure a Storage Plugin to use WASB
- Test Drill access to WASB
Before getting started, I've already setup an Azure Storage Account, created a container within it, and uploaded the file(s) I want to query into it. If you are not familiar with how to do this, check out this document to get oriented to the basic concepts. Also, consider using a third-party tool such as ClumsyLeaf's CloudXplorer to make working with the storage really easy for you.
Download the WASB JAR Files
To get started, I SSH into each Drill VM, navigate to the /drill/current/jars/3rdparty directory and download the required JARs here:
sudo wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.1/hadoop-azure-2.7.1.jar sudo wget http://central.maven.org/maven2/com/microsoft/azure/azure-storage/2.0.0/azure-storage-2.0.0.jar
Edit the Core-Site.XML File
While SSH'ed into each Drill VM, I modify the core-site.xml file in /drill/current/conf to hold the key for my storage account:
sudo vi core-site.xml
With the config file open in vi, I add the following property with the appropriate substitutions for the storage account name (identified here as mydatafiles) and the storage account key (provided here using the fake key value of 8qifG/AzE4avR6Z60NX4nusnW8Imt1ki9ZMPQ5pJpfRhsZ5rWOS374gRg== ):
If you are not familiar with how to access the key to your storage account, check out this document.
Configure a Storage Plug-In
In the previous post in this series, I recommended you open TCP port 8047 on one of your Drill VMs in order to allow you to access the Drill Web Console from your local PC. In this step, I will return to the Web Console to configure a Storage Plug-In for my WASB account.
On my local PC, I open a browser to http://dr004.westus.cloudapp.azure.com:8047 (as dr004 is the VM on which I opened this port) and navigate to the Storage page.
Here I can see two two default plug-ins, i.e. cp and dfs, already enabled. I click on the Update button associated with dfs and copy the JSON definition for this plug-in. I will use this as a template for my WASB plug-in. Once copied, simply click the Back button to return to the Storage page.
At the bottom of the Storage page, under the New Storage Plugin header, is a textbox next to a Create button. I enter a name of my storage account, e.g. mydatafiles or something else that's reasonably user-friendly, into the textbox and click the Create button. In the resulting screen, I delete the null value and paste the JSON definition copied earlier. I change the "file:///" value assigned to the connection key to "wasb://firstname.lastname@example.org/", assuming that the mydatafiles storage account has a container named mycontainer which holds the files I want to access.
Before clicking the Enable button to save this definition and enable the new Storage Plug-In, review the file formats in the JSON definition. In my environment, tab-delimited text files often employ the TXT file-extension. To make it easier to leverage these files, I add this format into the JSON definition as follows:
If you want to read more about adding formats to Storage Plugins, you can check out the Drill documentation here. Looking closely at the TSV definition in the default JSON, you might recognize that the TXT format specified here is just a copy-paste of the TSV definition with "tsv" changed to "txt" .
Test Drill Access to WASB
The Drill documentation notes that once I click the Enable button, the newly created Storage Plugin is immediately available across the cluster. While I can see that the Storage Plugin is registered in ZooKeeper - something I looked at while troubleshooting this step but otherwise not documented in these posts - I found that I needed to restart my Drill VMs in order for the plugin to be seen properly by Drill. I would recommend you do the same before attempting to access your WASB account. Otherwise, you may receive an error message indicating the WASB file system is unknown when you attempt to leverage your plugin.
To test access to my WASB storage, I use the Drill Web Console as before, but this time navigate to the Query page. Previously, I loaded a file named test.txt (download here) to the mycontainer container in the mydatafiles storage account. As this is the container registered with my Storage Plugin named mydatafiles, I can query this folder as follows:
SELECT columns as Letter, columns as Number FROM mydatafiles.`test.txt`;
When I submit the query, I am a returned a result that looks like this, indicating that Drill successfully accessed my storage account, located the file and processed it per the formatting instructions associated with the txt file extension: