Quickstart: Analyze data in Azure Data Lake Storage Gen2 by using Azure Databricks
This quickstart shows you how to run an Apache Spark job using Azure Databricks to perform analytics on data stored in a storage account that has Azure Data Lake Storage Gen2 enabled.
As part of the Spark job, you'll analyze a radio channel subscription data to gain insights into free/paid usage based on demographics.
If you don't have an Azure subscription, create a free account before you begin.
Create a Data Lake Gen2 storage account. See Quickstart: Create an Azure Data Lake Storage Gen2 storage account
Paste the name of the storage account into a text file. You'll need it soon.
Create a service principal. See How to: Use the portal to create an Azure AD application and service principal that can access resources.
There's a couple of specific things that you'll have to do as you perform the steps in that article.
✔️ When performing the steps in the Assign the application to a role section of the article, make sure to assign the Storage Blob Data Contributor role to the service principal.
Make sure to assign the role in the scope of the Data Lake Storage Gen2 storage account. You can assign a role to the parent resource group or subscription, but you'll receive permissions-related errors until those role assignments propagate to the storage account.
✔️ When performing the steps in the Get values for signing in section of the article, paste the tenant ID, app ID, and password values into a text file. You'll need those soon.
Create an Azure Databricks workspace
In this section, you create an Azure Databricks workspace using the Azure portal.
In the Azure portal, select Create a resource > Analytics > Azure Databricks.
Under Azure Databricks Service, provide the values to create a Databricks workspace.
Provide the following values:
Property Description Workspace name Provide a name for your Databricks workspace Subscription From the drop-down, select your Azure subscription. Resource group Specify whether you want to create a new resource group or use an existing one. A resource group is a container that holds related resources for an Azure solution. For more information, see Azure Resource Group overview. Location Select West US 2. Feel free to select another public region if you prefer. Pricing Tier Choose between Standard or Premium. For more information on these tiers, see Databricks pricing page.
The account creation takes a few minutes. To monitor the operation status, view the progress bar at the top.
Select Pin to dashboard and then select Create.
Create a Spark cluster in Databricks
In the Azure portal, go to the Databricks workspace that you created, and then select Launch Workspace.
You are redirected to the Azure Databricks portal. From the portal, select New > Cluster.
In the New cluster page, provide the values to create a cluster.
Fill in values for the following fields, and accept the default values for the other fields:
Enter a name for the cluster.
Make sure you select the Terminate after 120 minutes of inactivity checkbox. Provide a duration (in minutes) to terminate the cluster, if the cluster is not being used.
Select Create cluster. Once the cluster is running, you can attach notebooks to the cluster and run Spark jobs.
For more information on creating clusters, see Create a Spark cluster in Azure Databricks.
Create storage account container
In this section, you create a notebook in Azure Databricks workspace and then run code snippets to configure the storage account.
In the Azure portal, go to the Azure Databricks workspace you created, and then select Launch Workspace.
In the left pane, select Workspace. From the Workspace drop-down, select Create > Notebook.
In the Create Notebook dialog box, enter a name for the notebook. Select Scala as the language, and then select the Spark cluster that you created earlier.
Copy and paste the following code block into the first cell, but don't run this code yet.
spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth") spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<appID>") spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net", "<password>") spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant-id>/oauth2/token") spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true") dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/") spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")
This code block directly accesses the Data Lake Gen2 endpoint by using OAuth, but there are other ways to connect the Databricks workspace to your Data Lake Storage Gen2 account. For example, you could mount the container by using OAuth or use a direct access with Shared Key.
To see examples of these approaches, see the Azure Data Lake Storage Gen2 article on the Azure Databricks Website.
In this code block, replace the
tenant-idplaceholder values in this code block with the values that you collected when you created the service principal. Set the
container-nameplaceholder value to whatever name you want to give the container.
In a production setting, consider storing your authentication key in Azure Databricks. Then, add a look up key to your code block instead of the authentication key. After you've completed this quickstart, see the Azure Data Lake Storage Gen2 article on the Azure Databricks Website to see examples of this approach.
Press the SHIFT + ENTER keys to run the code in this block.
Ingest sample data
Before you begin with this section, you must complete the following prerequisites:
Enter the following code into a notebook cell:
%sh wget -P /tmp https://raw.githubusercontent.com/Azure/usql/master/Examples/Samples/Data/json/radiowebsite/small_radio_json.json
In the cell, press SHIFT + ENTER to run the code.
Now in a new cell below this one, enter the following code, and replace the values that appear in brackets with the same values you used earlier:
In the cell, press SHIFT + ENTER to run the code.
Run a Spark SQL Job
Perform the following tasks to run a Spark SQL job on the data.
Run a SQL statement to create a temporary table using data from the sample JSON data file, small_radio_json.json. In the following snippet, replace the placeholder values with your container name and storage account name. Using the notebook you created earlier, paste the snippet in a new code cell in the notebook, and then press SHIFT + ENTER.
%sql DROP TABLE IF EXISTS radio_sample_data; CREATE TABLE radio_sample_data USING json OPTIONS ( path "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/small_radio_json.json" )
Once the command successfully completes, you have all the data from the JSON file as a table in Databricks cluster.
%sqllanguage magic command enables you to run a SQL code from the notebook, even if the notebook is of another type. For more information, see Mixing languages in a notebook.
Let's look at a snapshot of the sample JSON data to better understand the query that you run. Paste the following snippet in the code cell and press SHIFT + ENTER.
%sql SELECT * from radio_sample_data
You see a tabular output like shown in the following screenshot (only some columns are shown):
Among other details, the sample data captures the gender of the audience of a radio channel (column name, gender) and whether their subscription is free or paid (column name, level).
You now create a visual representation of this data to show for each gender, how many users have free accounts and how many are paid subscribers. From the bottom of the tabular output, click the Bar chart icon, and then click Plot Options.
In Customize Plot, drag-and-drop values as shown in the screenshot.
- Set Keys to gender.
- Set Series groupings to level.
- Set Values to level.
- Set Aggregation to COUNT.
The output shows the visual representation as depicted in the following screenshot:
Clean up resources
Once you're finished with this article, you can terminate the cluster. From the Azure Databricks workspace, select Clusters and locate the cluster you want to terminate. Hover your mouse cursor over the ellipsis under Actions column, and select the Terminate icon.
If you do not manually terminate the cluster it automatically stops, provided you selected the Terminate after __ minutes of inactivity checkbox while creating the cluster. If you set this option the cluster stops after it has been inactive for the designated amount of time.
In this article, you created a Spark cluster in Azure Databricks and ran a Spark job using data in a storage account with Data Lake Storage Gen2 enabled. You can also look at Spark data sources to learn how to import data from other data sources into Azure Databricks. Advance to the next article to learn how to perform an ETL operation (extract, transform, and load data) using Azure Databricks.