Tutorial: Azure Data Lake Storage Gen2, Azure Databricks & Spark
This tutorial shows you how to connect your Azure Databricks cluster to data stored in an Azure storage account that has Azure Data Lake Storage Gen2 enabled. This connection enables you to natively run queries and analytics from your cluster on your data.
In this tutorial, you will:
- Create a Databricks cluster
- Ingest unstructured data into a storage account
- Run analytics on your data in Blob storage
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Create an Azure Data Lake Storage Gen2 account.
See Create a storage account to use with Azure Data Lake Storage Gen2.
Make sure that your user account has the Storage Blob Data Contributor role assigned to it.
Install AzCopy v10. See Transfer data with AzCopy v10
Create a service principal. See How to: Use the portal to create an Azure AD application and service principal that can access resources.
There's a couple of specific things that you'll have to do as you perform the steps in that article.
✔️ When performing the steps in the Assign the application to a role section of the article, make sure to assign the Storage Blob Data Contributor role to the service principal.
Important
Make sure to assign the role in the scope of the Data Lake Storage Gen2 storage account. You can assign a role to the parent resource group or subscription, but you'll receive permissions-related errors until those role assignments propagate to the storage account.
✔️ When performing the steps in the Get values for signing in section of the article, paste the tenant ID, app ID, and client secret values into a text file. You'll need those soon.
Download the flight data
This tutorial uses flight data from the Bureau of Transportation Statistics to demonstrate how to perform an ETL operation. You must download this data to complete the tutorial.
Go to Research and Innovative Technology Administration, Bureau of Transportation Statistics.
Select the Prezipped File check box to select all data fields.
Select the Download button and save the results to your computer.
Unzip the contents of the zipped file and make a note of the file name and the path of the file. You need this information in a later step.
Create an Azure Databricks service
In this section, you create an Azure Databricks service by using the Azure portal.
In the Azure portal, select Create a resource > Analytics > Azure Databricks.

Under Azure Databricks Service, provide the following values to create a Databricks service:
Property Description Workspace name Provide a name for your Databricks workspace. Subscription From the drop-down, select your Azure subscription. Resource group Specify whether you want to create a new resource group or use an existing one. A resource group is a container that holds related resources for an Azure solution. For more information, see Azure Resource Group overview. Location Select West US 2. For other available regions, see Azure services available by region. Pricing Tier Select Standard. 
The account creation takes a few minutes. To monitor the operation status, view the progress bar at the top.
Select Pin to dashboard and then select Create.
Create a Spark cluster in Azure Databricks
In the Azure portal, go to the Databricks service that you created, and select Launch Workspace.
You're redirected to the Azure Databricks portal. From the portal, select Cluster.

In the New cluster page, provide the values to create a cluster.

Fill in values for the following fields, and accept the default values for the other fields:
Enter a name for the cluster.
Make sure you select the Terminate after 120 minutes of inactivity checkbox. Provide a duration (in minutes) to terminate the cluster, if the cluster is not being used.
Select Create cluster. After the cluster is running, you can attach notebooks to the cluster and run Spark jobs.
Ingest data
Copy source data into the storage account
Use AzCopy to copy data from your .csv file into your Data Lake Storage Gen2 account.
Open a command prompt window, and enter the following command to log into your storage account.
azcopy loginFollow the instructions that appear in the command prompt window to authenticate your user account.
To copy data from the .csv account, enter the following command.
azcopy cp "<csv-folder-path>" https://<storage-account-name>.dfs.core.windows.net/<container-name>/folder1/On_Time.csvReplace the
<csv-folder-path>placeholder value with the path to the .csv file.Replace the
<storage-account-name>placeholder value with the name of your storage account.Replace the
<container-name>placeholder with the name of a container in your storage account.
Create a container and mount it
In this section, you'll create a container and a folder in your storage account.
In the Azure portal, go to the Azure Databricks service that you created, and select Launch Workspace.
On the left, select Workspace. From the Workspace drop-down, select Create > Notebook.

In the Create Notebook dialog box, enter a name for the notebook. Select Python as the language, and then select the Spark cluster that you created earlier.
Select Create.
Copy and paste the following code block into the first cell, but don't run this code yet.
configs = {"fs.azure.account.auth.type": "OAuth", "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider", "fs.azure.account.oauth2.client.id": "<appId>", "fs.azure.account.oauth2.client.secret": "<clientSecret>", "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token", "fs.azure.createRemoteFileSystemDuringInitialization": "true"} dbutils.fs.mount( source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/folder1", mount_point = "/mnt/flightdata", extra_configs = configs)In this code block, replace the
appId,clientSecret,tenant, andstorage-account-nameplaceholder values in this code block with the values that you collected while completing the prerequisites of this tutorial. Replace thecontainer-nameplaceholder value with the name of the container.Press the SHIFT + ENTER keys to run the code in this block.
Keep this notebook open as you will add commands to it later.
Use Databricks Notebook to convert CSV to Parquet
In the notebook that you previously created, add a new cell, and paste the following code into that cell.
# Use the previously established DBFS mount point to read the data.
# create a data frame to read data.
flightDF = spark.read.format('csv').options(
header='true', inferschema='true').load("/mnt/flightdata/*.csv")
# read the airline csv file and write the output to parquet format for easy query.
flightDF.write.mode("append").parquet("/mnt/flightdata/parquet/flights")
print("Done")
Explore data
In a new cell, paste the following code to get a list of CSV files uploaded via AzCopy.
import os.path
import IPython
from pyspark.sql import SQLContext
display(dbutils.fs.ls("/mnt/flightdata"))
To create a new file and list files in the parquet/flights folder, run this script:
dbutils.fs.put("/mnt/flightdata/1.txt", "Hello, World!", True)
dbutils.fs.ls("/mnt/flightdata/parquet/flights")
With these code samples, you have explored the hierarchical nature of HDFS using data stored in a storage account with Data Lake Storage Gen2 enabled.
Query the data
Next, you can begin to query the data you uploaded into your storage account. Enter each of the following code blocks into Cmd 1 and press Cmd + Enter to run the Python script.
To create data frames for your data sources, run the following script:
- Replace the
<csv-folder-path>placeholder value with the path to the .csv file.
# Copy this into a Cmd cell in your notebook.
acDF = spark.read.format('csv').options(
header='true', inferschema='true').load("/mnt/flightdata/On_Time.csv")
acDF.write.parquet('/mnt/flightdata/parquet/airlinecodes')
# read the existing parquet file for the flights database that was created earlier
flightDF = spark.read.format('parquet').options(
header='true', inferschema='true').load("/mnt/flightdata/parquet/flights")
# print the schema of the dataframes
acDF.printSchema()
flightDF.printSchema()
# print the flight database size
print("Number of flights in the database: ", flightDF.count())
# show the first 20 rows (20 is the default)
# to show the first n rows, run: df.show(n)
acDF.show(100, False)
flightDF.show(20, False)
# Display to run visualizations
# preferably run this in a separate cmd cell
display(flightDF)
Enter this script to run some basic analysis queries against the data.
# Run each of these queries, preferably in a separate cmd cell for separate analysis
# create a temporary sql view for querying flight information
FlightTable = spark.read.parquet('/mnt/flightdata/parquet/flights')
FlightTable.createOrReplaceTempView('FlightTable')
# create a temporary sql view for querying airline code information
AirlineCodes = spark.read.parquet('/mnt/flightdata/parquet/airlinecodes')
AirlineCodes.createOrReplaceTempView('AirlineCodes')
# using spark sql, query the parquet file to return total flights in January and February 2016
out1 = spark.sql("SELECT * FROM FlightTable WHERE Month=1 and Year= 2016")
NumJan2016Flights = out1.count()
out2 = spark.sql("SELECT * FROM FlightTable WHERE Month=2 and Year= 2016")
NumFeb2016Flights = out2.count()
print("Jan 2016: ", NumJan2016Flights, " Feb 2016: ", NumFeb2016Flights)
Total = NumJan2016Flights+NumFeb2016Flights
print("Total flights combined: ", Total)
# List out all the airports in Texas
out = spark.sql(
"SELECT distinct(OriginCityName) FROM FlightTable where OriginStateName = 'Texas'")
print('Airports in Texas: ', out.show(100))
# find all airlines that fly from Texas
out1 = spark.sql(
"SELECT distinct(Reporting_Airline) FROM FlightTable WHERE OriginStateName='Texas'")
print('Airlines that fly to/from Texas: ', out1.show(100, False))
Clean up resources
When they're no longer needed, delete the resource group and all related resources. To do so, select the resource group for the storage account and select Delete.