Managing dependencies in data pipelines
Developing and deploying a data processing pipeline often requires managing complex dependencies between tasks. For example, a pipeline might read data from a source, clean the data, transform the cleaned data, and writing the transformed data to a target. You need to test, schedule, and troubleshoot data pipelines when you operationalize them.
Workflow systems address these challenges by allowing you to define dependencies between tasks, schedule when pipelines run, and monitor workflows. Databricks recommends jobs with multiple tasks to manage your workflows without relying on an external system. Azure Databricks jobs provide task orchestration with standard authentication and access control methods. You can manage jobs using a familiar, user-friendly interface to create and manage complex workflows. You can define a job containing multiple tasks, where each task runs code such as a notebook or JAR, and control the execution order of tasks in a job by specifying dependencies between them. You can configure a job’s tasks to run in sequence or parallel.
Azure Databricks also supports workflow management with Azure Data Factory or Apache Airflow.
Azure Data Factory
Azure Data Factory is a cloud data integration service that lets you compose data storage, movement, and processing services into automated data pipelines. You can operationalize Databricks notebooks in Azure Data Factory data pipelines. See Run a Databricks notebook with the Databricks notebook activity in Azure Data Factory for instructions on how to create an Azure Data Factory pipeline that runs a Databricks notebook in an Azure Databricks cluster, followed by Transform data by running a Databricks notebook.
Apache Airflow
Apache Airflow is an open source solution for managing and scheduling data pipelines. Airflow represents data pipelines as directed acyclic graphs (DAGs) of operations. You define a workflow in a Python file and Airflow manages the scheduling and execution.
Airflow provides tight integration between Azure Databricks and Airflow. The Airflow Azure Databricks integration lets you take advantage of the optimized Spark engine offered by Azure Databricks with the scheduling features of Airflow.
Requirements
- The integration between Airflow and Azure Databricks is available in Airflow version 1.9.0 and later. The examples in this article are tested with Airflow version 2.1.0.
- Airflow requires Python 3.6, 3.7, or 3.8. The examples in this article are tested with Python 3.8.
Install the Airflow Azure Databricks integration
To install the Airflow Azure Databricks integration, open a terminal and run the following commands:
mkdir airflow
cd airflow
pipenv --python 3.8
pipenv shell
export AIRFLOW_HOME=$(pwd)
pipenv install apache-airflow==2.1.0
pipenv install apache-airflow-providers-databricks
mkdir dags
airflow db init
airflow users create --username admin --firstname <firstname> --lastname <lastname> --role Admin --email your@email.com
These commands:
- Create a directory named
airflowand change into that directory. - Use
pipenvto create and spawn a Python virtual environment. Databricks recommends using a Python virtual environment to isolate package versions and code dependencies to that environment. This isolation helps reduce unexpected package version mismatches and code dependency collisions. - Initialize an environment variable named
AIRFLOW_HOMEset to the path of theairflowdirectory. - Install Airflow and the Airflow Databricks provider packages.
- Create an
airflow/dagsdirectory. Airflow uses thedagsdirectory to store DAG definitions. - Initialize a SQLite database that Airflow uses to track metadata. In a production Airflow deployment, you would configure Airflow with a standard database. The SQLite database and default configuration for your Airflow deployment are initialized in the
airflowdirectory. - Create an admin user for Airflow.
To install extras, for example celery and password, run:
pip install "apache-airflow[databricks, celery, password]"
Start the Airflow web server and scheduler
The Airflow web server is required to view the Airflow UI. To start the web server, open a terminal and run the following command:
airflow webserver
The scheduler is the Airflow component that schedules DAGs. To run it, open a new terminal and run the following command:
pipenv shell
export AIRFLOW_HOME=$(pwd)
airflow scheduler
Test the Airflow installation
To verify the Airflow installation, you can run one of the example DAGs included with Airflow:
- In a browser window, open http://localhost:8080/home. The Airflow DAGs screen appears.
- Click the Pause/Unpause DAG toggle to unpause one of the example DAGs, for example, the
example_python_operator. - Trigger the example DAG by clicking the Start button.
- Click the DAG name to view details, including the run status of the DAG.
Run an Azure Databricks job from Airflow
The Airflow Azure Databricks integration provides two different operators for triggering jobs:
- The DatabricksRunNowOperator requires an existing Azure Databricks job and uses the Trigger a new job run (
POST /jobs/run-now) API request to trigger a run. Databricks recommends usingDatabricksRunNowOperatorbecause it reduces duplication of job definitions and job runs triggered with this operator are easy to find in the jobs UI. - The DatabricksSubmitRunOperator does not require a job to exist in Azure Databricks and uses the Create and trigger a one-time run (
POST /jobs/runs/submit) API request to submit the job specification and trigger a run.
The Databricks Airflow operator writes the job run page URL to the Airflow logs every polling_period_seconds (the default is 30 seconds). For more information, see the apache-airflow-providers-databricks package page on the Airflow website.
Example
The following example demonstrates how to create a simple Airflow deployment that runs on your local machine and deploys an example DAG to trigger runs in Azure Databricks. For this example, you:
- Create a new notebook and add code to print a greeting based on a configured parameter.
- Create an Azure Databricks job with a single task that runs the notebook.
- Configure an Airflow connection to your Azure Databricks workspace.
- Create an Airflow DAG to trigger the notebook job. You define the DAG in a Python script using
DatabricksRunNowOperator. - Use the Airflow UI to trigger the DAG and view the run status.
Create a notebook
This example uses a notebook containing two cells:
- The first cell contains a Databricks Utilities text widget defining a variable named
greetingset to the default valueworld. - The second cell prints the value of the
greetingvariable prefixed byhello.
To create the notebook:
Go to your Azure Databricks landing page and select Create Blank Notebook or click
Create in the sidebar and select Notebook from the menu. The Create Notebook dialog appears.In the Create Notebook dialog, give your notebook a name, such as Hello Airflow. Set Default Language to Python. Leave Cluster set to the default value. You will configure the cluster when you create a task that uses this notebook.
Click Create.
Copy the following Python code and paste it into the first cell of the notebook.
dbutils.widgets.text("greeting", "world", "Greeting") greeting = dbutils.widgets.get("greeting")Add a new cell below the first cell and copy and paste the following Python code into the new cell:
print("hello {}".format(greeting))
Create a job
Click
Workflows in the sidebar.Click
.The Tasks tab displays with the create task dialog.

Replace Add a name for your job… with your job name.
In the Task name field, enter a name for the task, for example, greeting-task.
In the Type drop-down, select Notebook.
Use the file browser to find the notebook you created, click the notebook name, and click Confirm.
Click Add under Parameters. In the Key field, enter
greeting. In the Value field, enterAirflow user.Click Create task.
Run the job
To run the job immediately, click
in the upper right corner. You can also run the job by clicking the Runs tab and clicking Run Now in the Active Runs table.
View run details
Click the Runs tab and click View Details in the Active Runs table or the Completed Runs (past 60 days) table.
Copy the Job ID value. This value is required to trigger the job from Airflow.

Create an Azure Databricks personal access token
Note
As a security best practice, when authenticating with automated tools, systems, scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. For more information, see Service principals for Azure Databricks automation.
Airflow connects to Databricks using an Azure Databricks personal access token (PAT). See personal access token for instructions on creating a PAT.
Configure an Azure Databricks connection
Your Airflow installation contains a default connection for Azure Databricks. To update the connection to connect to your workspace using the personal access token you created above:
In a browser window, open http://localhost:8080/connection/list/.
Under Conn ID, locate databricks_default and click the Edit record button.
Replace the value in the Host field with the workspace instance name of your Azure Databricks deployment.
In the Extra field, enter the following value:
{"token": "PERSONAL_ACCESS_TOKEN"}Replace
PERSONAL_ACCESS_TOKENwith your Azure Databricks personal access token.
Create a new DAG
You define an Airflow DAG in a Python file. To create a DAG to trigger the example notebook job:
In a text editor or IDE, create a new file named
databricks_dag.pywith the following contents:from airflow import DAG from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator from airflow.utils.dates import days_ago default_args = { 'owner': 'airflow' } with DAG('databricks_dag', start_date = days_ago(2), schedule_interval = None, default_args = default_args ) as dag: opr_run_now = DatabricksRunNowOperator( task_id = 'run_now', databricks_conn_id = 'databricks_default', job_id = JOB_ID )Replace
JOB_IDwith the value of the job ID saved earlier.Save the file in the
airflow/dagsdirectory. Airflow automatically reads and installs DAG files stored inairflow/dags/.
Install and verify the DAG in Airflow
To trigger and verify the DAG in the Airflow UI:
- In a browser window, open http://localhost:8080/home. The Airflow DAGs screen appears.
- Locate
databricks_dagand click the Pause/Unpause DAG toggle to unpause the DAG. - Trigger the DAG by clicking the Start button.
- Click a run in the Runs column to view the status and details of the run.
Feedback
Submit and view feedback for