Modularize or link code in notebooks
This article describes how to use Databricks notebooks to code complex workflows that use modular code, linked or embedded notebooks, and if-then-else logic.
Ways to modularize or link notebooks
The %run command allows you to include another notebook within a notebook. You can use %run
to modularize your code, for example by putting supporting functions in a separate notebook. You can also use it to concatenate notebooks that implement the steps in an analysis. When you use %run
, the called notebook is immediately executed and the functions and variables defined in it become available in the calling notebook.
Notebook workflows are a complement to %run
because they let you pass parameters to and return values from a notebook. This allows you to build complex workflows and pipelines with dependencies. For example, you can get a list of files in a directory and pass the names to another notebook, which is not possible with %run
. You can also create if-then-else workflows based on return values or call other notebooks using relative paths.
To implement notebook workflows, use the dbutils.notebook.*
methods. Unlike %run
, the dbutils.notebook.run()
method starts a new job to run the notebook.
These methods, like all of the dbutils
APIs, are available only in Python and Scala. However, you can use dbutils.notebook.run()
to invoke an R notebook.
Warning
Jobs based on notebook workflows must complete in 30 days or less. Longer-running jobs based on modularized or linked notebook tasks aren’t supported.
API
The methods available in the dbutils.notebook
API to build notebook workflows are: run
and exit
. Both parameters and return values must be strings.
run(path: String, timeout_seconds: int, arguments: Map): String
Run a notebook and return its exit value. The method starts an ephemeral job that runs immediately.
The timeout_seconds
parameter controls the timeout of the run (0 means no timeout): the call to
run
throws an exception if it doesn’t finish within the specified time. If Azure Databricks is down for more than 10 minutes,
the notebook run fails regardless of timeout_seconds
.
The arguments
parameter sets widget values of the target notebook. Specifically, if the notebook you are running has a widget
named A
, and you pass a key-value pair ("A": "B")
as part of the arguments parameter to the run()
call,
then retrieving the value of widget A
will return "B"
. You can find the instructions for creating and
working with widgets in the Databricks widgets article.
Warning
The arguments
parameter accepts only Latin characters (ASCII character set). Using non-ASCII characters will return an error. Examples of invalid, non-ASCII characters are Chinese, Japanese kanjis, and emojis.
run
Usage
Python
dbutils.notebook.run("notebook-name", 60, {"argument": "data", "argument2": "data2", ...})
Scala
dbutils.notebook.run("notebook-name", 60, Map("argument" -> "data", "argument2" -> "data2", ...))
run
Example
Suppose you have a notebook named workflows
with a widget named foo
that prints the widget’s value:
dbutils.widgets.text("foo", "fooDefault", "fooEmptyLabel")
print dbutils.widgets.get("foo")
Running dbutils.notebook.run("workflows", 60, {"foo": "bar"})
produces the following result:
The widget had the value you passed in through the workflow, "bar"
, rather than the default.
exit(value: String): void
Exit a notebook with a value. If you call a notebook using the run
method, this is the value returned.
dbutils.notebook.exit("returnValue")
Calling dbutils.notebook.exit
in a job causes the notebook to complete successfully. If you want to cause the job to fail, throw an exception.
Example
In the following example, you pass arguments to DataImportNotebook
and run different notebooks (DataCleaningNotebook
or ErrorHandlingNotebook
) based on the result from DataImportNotebook
.
When the notebook workflow runs, you see a link to the running notebook:
Click the notebook link Notebook job #xxxx to view the details of the run:
Pass structured data
This section illustrates how to pass structured data between notebooks.
Python
# Example 1 - returning data through temporary views.
# You can only return one string using dbutils.notebook.exit(), but since called notebooks reside in the same JVM, you can
# return a name referencing data stored in a temporary view.
## In callee notebook
spark.range(5).toDF("value").createOrReplaceGlobalTempView("my_data")
dbutils.notebook.exit("my_data")
## In caller notebook
returned_table = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
display(table(global_temp_db + "." + returned_table))
# Example 2 - returning data through DBFS.
# For larger datasets, you can write the results to DBFS and then return the DBFS path of the stored data.
## In callee notebook
dbutils.fs.rm("/tmp/results/my_data", recurse=True)
spark.range(5).toDF("value").write.format("parquet").load("dbfs:/tmp/results/my_data")
dbutils.notebook.exit("dbfs:/tmp/results/my_data")
## In caller notebook
returned_table = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
display(spark.read.format("parquet").load(returned_table))
# Example 3 - returning JSON data.
# To return multiple values, you can use standard JSON libraries to serialize and deserialize results.
## In callee notebook
import json
dbutils.notebook.exit(json.dumps({
"status": "OK",
"table": "my_data"
}))
## In caller notebook
result = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
print(json.loads(result))
Scala
// Example 1 - returning data through temporary views.
// You can only return one string using dbutils.notebook.exit(), but since called notebooks reside in the same JVM, you can
// return a name referencing data stored in a temporary view.
/** In callee notebook */
sc.parallelize(1 to 5).toDF().createOrReplaceGlobalTempView("my_data")
dbutils.notebook.exit("my_data")
/** In caller notebook */
val returned_table = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
val global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
display(table(global_temp_db + "." + returned_table))
// Example 2 - returning data through DBFS.
// For larger datasets, you can write the results to DBFS and then return the DBFS path of the stored data.
/** In callee notebook */
dbutils.fs.rm("/tmp/results/my_data", recurse=true)
sc.parallelize(1 to 5).toDF().write.format("parquet").save("dbfs:/tmp/results/my_data")
dbutils.notebook.exit("dbfs:/tmp/results/my_data")
/** In caller notebook */
val returned_table = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
display(sqlContext.read.format("parquet").load(returned_table))
// Example 3 - returning JSON data.
// To return multiple values, you can use standard JSON libraries to serialize and deserialize results.
/** In callee notebook */
// Import jackson json libraries
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
// Create a json serializer
val jsonMapper = new ObjectMapper with ScalaObjectMapper
jsonMapper.registerModule(DefaultScalaModule)
// Exit with json
dbutils.notebook.exit(jsonMapper.writeValueAsString(Map("status" -> "OK", "table" -> "my_data")))
/** In caller notebook */
val result = dbutils.notebook.run("LOCATION_OF_CALLEE_NOTEBOOK", 60)
println(jsonMapper.readValue[Map[String, String]](result))
Handle errors
This section illustrates how to handle errors in notebook workflows.
Python
# Errors in workflows thrown a WorkflowException.
def run_with_retry(notebook, timeout, args = {}, max_retries = 3):
num_retries = 0
while True:
try:
return dbutils.notebook.run(notebook, timeout, args)
except Exception as e:
if num_retries > max_retries:
raise e
else:
print("Retrying error", e)
num_retries += 1
run_with_retry("LOCATION_OF_CALLEE_NOTEBOOK", 60, max_retries = 5)
Scala
// Errors in workflows thrown a WorkflowException.
import com.databricks.WorkflowException
// Since dbutils.notebook.run() is just a function call, you can retry failures using standard Scala try-catch
// control flow. Here we show an example of retrying a notebook a number of times.
def runRetry(notebook: String, timeout: Int, args: Map[String, String] = Map.empty, maxTries: Int = 3): String = {
var numTries = 0
while (true) {
try {
return dbutils.notebook.run(notebook, timeout, args)
} catch {
case e: WorkflowException if numTries < maxTries =>
println("Error, retrying: " + e)
}
numTries += 1
}
"" // not reached
}
runRetry("LOCATION_OF_CALLEE_NOTEBOOK", timeout = 60, maxTries = 5)
Run multiple notebooks concurrently
You can run multiple notebooks at the same time by using standard Scala and Python constructs such as Threads (Scala, Python) and Futures (Scala, Python). The advanced notebook workflow notebooks demonstrate how to use these constructs. The notebooks are in Scala but you could easily write the equivalent in Python. To run the example:
- Download the notebook archive.
- Import the archive into a workspace.
- Run the Concurrent Notebooks notebook.
Feedback
Submit and view feedback for