Run a Databricks notebook with the Databricks Notebook Activity in Azure Data Factory

In this tutorial, you use the Azure portal to create an Azure Data Factory pipeline that executes a Databricks notebook against the Databricks jobs cluster. It also passes Azure Data Factory parameters to the Databricks notebook during execution.

You perform the following steps in this tutorial:

  • Create a data factory.

  • Create a pipeline that uses Databricks Notebook Activity.

  • Trigger a pipeline run.

  • Monitor the pipeline run.

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites

  • Azure Databricks workspace. Create a Databricks workspace or use an existing one. You create a Python notebook in your Azure Databricks workspace. Then you execute the notebook and pass parameters to it using Azure Data Factory.

Create a data factory

  1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in Microsoft Edge and Google Chrome web browsers.

  2. Select New on the left menu, select Data + Analytics, and then select Data Factory.

    Create a new data factory

  3. In the New data factory pane, enter ADFTutorialDataFactory under Name.

    The name of the Azure data factory must be globally unique. If you see the following error, change the name of the data factory. (For example, use <yourname>ADFTutorialDataFactory). For naming rules for Data Factory artifacts, see the Data Factory - naming rules article.

    Provide a name for the new data factory

  4. For Subscription, select your Azure subscription in which you want to create the data factory.

  5. For Resource Group, take one of the following steps:

    • Select Use existing and select an existing resource group from the drop-down list.

    • Select Create new and enter the name of a resource group.

    Some of the steps in this quickstart assume that you use the name ADFTutorialResourceGroup for the resource group. To learn about resource groups, see Using resource groups to manage your Azure resources.

  6. For Version, select V2 (Preview).

  7. For Location, select the location for the data factory.

    Currently, Data Factory V2 allows you to create data factories only in the East US, East US2, and West Europe regions. The data stores (like Azure Storage and Azure SQL Database) and computes (like HDInsight) that Data Factory uses can be in other regions.

  8. Select Pin to dashboard.

  9. Select Create.

  10. On the dashboard, you see the following tile with the status Deploying Data Factory:

  11. After the creation is complete, you see the Data factory page. Select the Author & Monitor tile to start the Data Factory UI application on a separate tab.

    Launch the data factory UI application

Create linked services

In this section, you author a Databricks linked service. This linked service contains the connection information to the Databricks cluster:

Create an Azure Databricks linked service

  1. On the Let's get started page, switch to the Edit tab in the left panel.

    Edit the new linked service

  2. Select Connections at the bottom of the window, and then select + New.

    Create a new connection

  3. In the New Linked Service window, select Data Store > Azure Databricks, and then select Continue.

    Specify a Databricks linked service

  4. In the New Linked Service window, complete the following steps:

    1. For Name, enter AzureDatabricks_LinkedService

    2. For Cluster, select a New Cluster

    3. For Domain/ Region, select the region where your Azure Databricks workspace is located.

    4. For Cluster node type, select Standard_D3_v2 for this tutorial.

    5. For Access Token, generate it from Azure Databricks workplace. You can find the steps here.

    6. For Cluster version, select 4.0 Beta (latest version)

    7. For Number of worker nodes, enter 2.

    8. Select Finish

      Finish creating the linked service

Create a pipeline

  1. Select the + (plus) button, and then select Pipeline on the menu.

    Buttons for creating a new pipeline

  2. Create a parameter to be used in the Pipeline. Later you pass this parameter to the Databricks Notebook Activity. In the empty pipeline, click on the Parameters tab, then New and name it as 'name'.

    Create a new parameter

    Create the name parameter

  3. In the Activities toolbox, expand Databricks. Drag the Notebook activity from the Activities toolbox to the pipeline designer surface.

    Drag the notebook to the designer surface

  4. In the properties for the Databricks Notebook activity window at the bottom, complete the following steps:

    a. Switch to the Settings tab.

    b. Select myAzureDatabricks_LinkedService (which you created in the previous procedure).

    c. Select a Databricks Notebook path. Let’s create a notebook and specify the path here. You get the Notebook Path by following the next few steps.

    1. Launch your Azure Databricks Workspace

    2. Create a New Folder in Workplace and call it as adftutorial.

      Create a new folder

    3. Create a new notebook (Python), let’s call it mynotebook under adftutorial Folder, click Create.

      Create a new notebook

      Set the properties of the new notebook

    4. In the newly created notebook "mynotebook'" add the following code:

      # Creating widgets for leveraging parameters, and printing the parameters
      
      dbutils.widgets.text("input", "","")
      dbutils.widgets.get("input")
      y = getArgument("input")
      print "Param -\'input':"
      print y
      

      Create widgets for parameters

    5. The Notebook Path in this case is /adftutorial/mynotebook

  5. Switch back to the Data Factory UI authoring tool. Navigate to Settings Tab under the Notebook1 Activity.

    a. Add Parameter to the Notebook activity. You use the same parameter that you added earlier to the Pipeline.

    Add a parameter

    b. Name the parameter as input and provide the value as expression @pipeline().parameters.name.

  6. To validate the pipeline, select the Validate button on the toolbar. To close the validation window, select the >> (right arrow) button.

    Validate the pipeline

  7. Select Publish All. The Data Factory UI publishes entities (linked services and pipeline) to the Azure Data Factory service.

    Publish the new data factory entities

Trigger a pipeline run

Select Trigger on the toolbar, and then select Trigger Now.

Select the Trigger Now command

The Pipeline Run dialog box asks for the name parameter. Use /path/filename as the parameter here. Click Finish.

Provide a value for the name parameters

Monitor the pipeline run

  1. Switch to the Monitor tab. Confirm that you see a pipeline run. It takes approximately 5-8 minutes to create a Databricks job cluster, where the notebook is executed.

    Monitor the pipeline

  2. Select Refresh periodically to check the status of the pipeline run.

  3. To see activity runs associated with the pipeline run, select View Activity Runs in the Actions column.

    View the activity runs

You can switch back to the pipeline runs view by selecting the Pipelines link at the top.

Verify the output

You can log on to the Azure Databricks workspace, go to Clusters and you can see the Job status as pending execution, running, or terminated.

View the job cluster and the job

You can click on the Job name and navigate to see further details. On successful run, you can validate the parameters passed and the output of the Python notebook.

View the run details and output

Next steps

The pipeline in this sample triggers a Databricks Notebook activity and passes a parameter to it. You learned how to:

  • Create a data factory.

  • Create a pipeline that uses a Databricks Notebook activity.

  • Trigger a pipeline run.

  • Monitor the pipeline run.