Quickstart: Run a Spark job on Azure Databricks using the Azure portal
This quickstart shows how to create an Azure Databricks workspace and an Apache Spark cluster within that workspace. Finally, you learn how to run a Spark job on the Databricks cluster. For more information on Azure Databricks, see What is Azure Databricks?
In this quickstart, as part of the Spark job, you analyze Boston safety data to gain insights into different reporting methods.
If you don't have an Azure subscription, create a free account before you begin.
Sign in to the Azure portal
Sign in to the Azure portal.
Create an Azure Databricks workspace
In this section, you create an Azure Databricks workspace using the Azure portal.
In the Azure portal, select Create a resource > Analytics > Azure Databricks.
Under Azure Databricks Service, provide the values to create a Databricks workspace.
Provide the following values:
Property Description Workspace name Provide a name for your Databricks workspace Subscription From the drop-down, select your Azure subscription. Resource group Specify whether you want to create a new resource group or use an existing one. A resource group is a container that holds related resources for an Azure solution. For more information, see Azure Resource Group overview. Location Select West US 2. For other available regions, see Azure services available by region. Pricing Tier Choose between Standard, Premium, or Trial. For more information on these tiers, see Databricks pricing page.
Select Pin to dashboard and then click Create.
The workspace creation takes a few minutes. During workspace creation, you can view the deployment status in Notifications.
Create a Spark cluster in Databricks
To use a free account to create the Azure Databricks cluster, before creating the cluster, go to your profile and change your subscription to pay-as-you-go. For more information, see Azure free account.
In the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace.
You are redirected to the Azure Databricks portal. From the portal, click New Cluster.
In the New cluster page, provide the values to create a cluster.
Accept all other default values other than the following:
Enter a name for the cluster.
For this article, create a cluster with 5.2 runtime.
Make sure you select the Terminate after __ minutes of inactivity checkbox. Provide a duration (in minutes) to terminate the cluster, if the cluster is not being used.
Select Create cluster. Once the cluster is running, you can attach notebooks to the cluster and run Spark jobs.
For more information on creating clusters, see Create a Spark cluster in Azure Databricks.
Run a Spark SQL job
Perform the following tasks to create a notebook in Databricks, configure the notebook to read data from an Azure Open Datasets, and then run a Spark SQL job on the data.
In the left pane, select Azure Databricks. From the Common Tasks, select New Notebook.
In the Create Notebook dialog box, enter a name, select Python as the language, and select the Spark cluster that you created earlier.
In this step, create a Spark DataFrame with Boston Safety Data from Azure Open Datasets, and use SQL to query the data.
The following command sets the Azure storage access information. Paste this PySpark code into the first cell and use Shift+Enter to run the code.
blob_account_name = "azureopendatastorage" blob_container_name = "citydatacontainer" blob_relative_path = "Safety/Release/city=Boston" blob_sas_token = r"?st=2019-02-26T02%3A34%3A32Z&se=2119-02-27T02%3A34%3A00Z&sp=rl&sv=2018-03-28&sr=c&sig=XlJVWA7fMXCSxCKqJm8psMOh0W4h7cSYO28coRqF2fs%3D"
The following command allows Spark to read from Blob storage remotely. Paste this PySpark code into the next cell and use Shift+Enter to run the code.
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path) spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name), blob_sas_token) print('Remote blob path: ' + wasbs_path)
The following command creates a DataFrame. Paste this PySpark code into the next cell and use Shift+Enter to run the code.
df = spark.read.parquet(wasbs_path) print('Register the DataFrame as a SQL temporary view: source') df.createOrReplaceTempView('source')
Run a SQL statement return the top 10 rows of data from the temporary view called source. Paste this PySpark code into the next cell and use Shift+Enter to run the code.
print('Displaying top 10 rows: ') display(spark.sql('SELECT * FROM source LIMIT 10'))
You see a tabular output like shown in the following screenshot (only some columns are shown):
You now create a visual representation of this data to show how many safety events are reported using the Citizens Connect App and City Worker App instead of other sources. From the bottom of the tabular output, select the Bar chart icon, and then click Plot Options.
In Customize Plot, drag-and-drop values as shown in the screenshot.
Set Keys to source.
Set Values to <\id>.
Set Aggregation to COUNT.
Set Display type to Pie chart.
Clean up resources
After you have finished the article, you can terminate the cluster. To do so, from the Azure Databricks workspace, from the left pane, select Clusters. For the cluster you want to terminate, move the cursor over the ellipsis under Actions column, and select the Terminate icon.
If you do not manually terminate the cluster it will automatically stop, provided you selected the Terminate after __ minutes of inactivity checkbox while creating the cluster. In such a case, the cluster automatically stops, if it has been inactive for the specified time.
In this article, you created a Spark cluster in Azure Databricks and ran a Spark job using data from Azure Open Datasets. You can also look at Spark data sources to learn how to import data from other data sources into Azure Databricks. Advance to the next article to learn how to perform an ETL operation (extract, transform, and load data) using Azure Databricks.