Jobs

Note

This article describes how to create, run, and manage single-task jobs using the generally-available jobs interface. For information about Public Preview features that support orchestration of jobs with multiple tasks, see Jobs with multiple tasks.

A job is a non-interactive way to run an application in an Azure Databricks cluster, for example, an ETL job or data analysis task you want to run immediately or on a scheduled basis. You can also run jobs interactively in the notebook UI.

You can create and run a job using the UI, the CLI, and invoking the Jobs API. You can monitor job run results in the UI, using the CLI, querying the API, and email alerts. This article focuses on performing job tasks using the UI. For the other methods, see Jobs CLI and Jobs API.

Important

  • You can create jobs only in a Data Science & Engineering workspace.
  • A workspace is limited to 1000 concurrent job runs. A 429 Too Many Requests response is returned when you request a run that cannot be started immediately.
  • The number of jobs a workspace can create in an hour is limited to 5000 (includes “run now” and “runs submit”). This limit also affects jobs created by the REST API and notebook workflows.

View jobs

Click Jobs Icon Jobs in the sidebar. The Jobs list displays. The Jobs page lists all defined jobs, the cluster definition, the schedule if any, and the result of the last run.

You can filter jobs in the Jobs list:

  • Using keywords.
  • Selecting only the jobs you own.
  • Selecting all jobs you have permissions to access. Access to this filter requires that Jobs access control is enabled.

You can also click any column header to sort the list of jobs (either descending or ascending) by that column. The default sorting is by job name in ascending order.

Azure workspace jobs list

Create a job

  1. Do one of the following:

    • From the Jobs list, click + Create Job.
    • In the sidebar, click the Create button and select Job from the menu.

    The job detail page displays.

    Job detail

  2. Enter a name for the job in the text field with the placeholder text Job name.

  3. Use Run Type to select whether to run your job manually or automatically on a schedule. Select Manual / Paused to run your job only when manually triggered, or Scheduled to define a schedule for running the job. See Schedule a job.

  4. Specify the type of task to run. In the Type drop-down, select Notebook, JAR, or Spark Submit.

    • Notebook

      1. Click Select Notebook.
      2. Use the file browser to find the notebook and click the notebook name to highlight it.
      3. Click Confirm.
    • JAR

      1. Specify the Main class. Use the full class name of the class containing the main method.
      2. Specify one or more Dependent Libraries. One of these libraries must contain the main class.

      To learn more about JAR tasks, see JAR job tips.

    • Spark Submit

      1. In the Parameters text box, specify the main class, the path to the library JAR, and all arguments, formatted as a JSON array of strings.

      Important

      There are several limitations for spark-submit tasks:

      • You can run spark-submit tasks only on new clusters.
      • Spark-submit does not support cluster autoscaling. To learn more about autoscaling, see Cluster autoscaling.
      • Spark-submit does not support Databricks Utilities. To use Databricks Utilities, use JAR tasks instead.
      • For more information on which parameters may be passed to a spark-submit task, see SparkSubmitTask.
  5. Configure the cluster to run the task. In the Cluster drop-down, select either New Job Cluster or Existing All-Purpose Cluster.

    • New Job Cluster: Click Edit in the Cluster drop-down and complete the cluster configuration.
    • Existing All-Purpose Cluster: Select an existing cluster in the Cluster drop-down. To open the cluster in a new page, click the External Link icon to the right of the cluster name and description.

    To learn more about selecting and configuring clusters to run tasks, see Cluster configuration tips.

  6. You can pass Parameters to all task types, but how you format and pass the parameters depends on the task type.

    • Notebook tasks: Click Add and specify the key and value of each parameter to pass to the task. You can override or add additional parameters when manually running a task with Run Now with Additional Parameters. Parameters set the value of the notebook widget specified by the key of the parameter. Use Task parameter variables to pass a limited set of dynamic values as part of a parameter value.
    • JAR tasks: Use a JSON-formatted array of strings to specify parameters. These strings are passed as arguments to the main method of the main class. See Configure JAR job parameters
    • Spark Submit task: Parameters are specified as a JSON-formatted array of strings. Conforming to the Apache Spark spark-submit convention, parameters after the JAR path are passed to the main method of the main class.
  7. To optionally allow multiple concurrent runs of the same job, enter a new value for Maximum Concurrent Runs. See Maximum concurrent runs.

  8. Optionally specify email addresses to receive Email Alerts on job events. See Alerts.

  9. To access additional options including Dependent Libraries, Retry Policy, and Timeouts, click the Jobs Vertical Ellipsis icon in the upper-right corner of the task card. See Task configuration options.

  10. When you’re done configuring the job, click the Create button to create the new job.

Run a job

Select a job and click the Runs tab. You can run a job immediately or schedule the job to run later.

Run a job immediately

To run the job immediately, click the Run Now button next to the Job Name.

Run now

Tip

You can perform a test run of a job with a notebook task by clicking Run Now. If you need to make changes to the notebook, clicking Run Now again after editing the notebook will automatically run the new version of the notebook.

Run a job with different parameters

You can use Run Now with Different Parameters to re-run a job with different parameters or different values for existing parameters.

  1. Click Blue Down Caret next to Run Now and select Run Now with Different Parameters or, in the Active runs table, click Run Now with Different Parameters. Enter the new parameters depending on the type of task.

    • Notebook: You can enter parameters as key value pairs or a JSON object. You can use this dialog to set the values of widgets:

      Run notebook with parameters

    • JAR and spark-submit: You can enter a list of parameters or a JSON document. The provided parameters are merged with the default parameters for the triggered run. If you delete keys, the default parameters are used. You can also add Task parameter variables for the run.

      Set spark-submit parameters

  2. Click Run.

Schedule a job

To define a schedule for the job:

  1. Click the Configuration tab. Set the Run Type to Scheduled.

    Edit schedule

  2. Specify the period, starting time, and time zone. Optionally select the Show Cron Syntax checkbox to display and edit the schedule in Quartz Cron Syntax.

    Note

    • Azure Databricks enforces a minimum interval of 10 seconds between subsequent runs triggered by the schedule of a job regardless of the seconds configuration in the cron expression.
    • You can choose a time zone that observes daylight saving time or a UTC time. If you select a zone that observes daylight saving time, an hourly job will be skipped or may appear to not fire for an hour or two when daylight saving time begins or ends. To run at every hour (absolute time), choose a UTC time.
    • The job scheduler, like the Spark batch interface, is not intended for low latency jobs. Due to network or cloud issues, job runs may occasionally be delayed up to several minutes. In these situations, scheduled jobs will run immediately upon service availability.
  3. Click Save.

Pause and resume a job schedule

To pause a job, click the Configuration tab. Set the Run Type to Manual / Paused.

Job scheduled

To resume a paused job schedule, set the Run Type to Scheduled.

View job runs

On the Jobs page, click a job name in the Name column. The Runs tab shows active runs and completed runs.

Job details

You can view the standard error, standard output and log4j output for a job run by clicking the Logs link in the Spark column.

Azure Databricks maintains a history of your job runs for up to 60 days. If you need to preserve job runs, Databricks recommends that you export results before they expire. For more information, see Export job run results.

View job run details

The job run details page contains job output and links to logs. You can access job run details from the Jobs page or the Compute page.

Job run details

To view job run details from the Jobs page, click Jobs Icon Jobs. Click the link for the run in the Run column of the Completed in past 60 days table.

Job run from Jobs

To view job run details from the Compute page, click compute icon Compute. Click the Job Run link for the selected job in the Job Clusters table.

Job run from Clusters

Export job run results

You can export notebook run results and job run logs for all job types.

Export notebook run results

You can persist job runs by exporting their results. For notebook job runs, you can export a rendered notebook that can later be imported into your Azure Databricks workspace.

  1. In the job detail page, click a job run name in the Run column.

    Job run

  2. Click Export to HTML.

    Export run result

Export job run logs

You can also export the logs for your job run. To automate this process, you can set up your job to automatically deliver logs to DBFS through the Job API. For more information, see the NewCluster and ClusterLogConf fields in the Job Create API call.

Edit a job

To change a job’s configuration, click the job name link in the Jobs list, then click the Configuration tab.

Clone a job

You can quickly create a new job by cloning the configuration of an existing job. Cloning a job creates an identical copy of the job, except for the job ID. On the job’s page, click More … next to the job’s name and select Clone from the dropdown menu.

Delete a job

On the job’s page, click More … next to the job’s name and select Delete from the dropdown menu.

Job configuration options

Maximum concurrent runs

The maximum number of runs that can be run in parallel. On starting a new run, Azure Databricks skips the run if the job has already reached its maximum number of active runs. Set this value higher than the default of 1 to perform multiple runs of the same job concurrently. This is useful, for example, if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other, or if you want to trigger multiple runs that differ by their input parameters.

Alerts

Email alerts sent in case of job failure, success, or timeout. You can set alerts up for job start, job success, and job failure (including skipped jobs), providing multiple comma-separated email addresses for each alert type. You can also opt out of alerts for skipped job runs.

Configure email alerts

Integrate these email alerts with your favorite notification tools, including:

Control access to jobs

Job access control enable job owners and administrators to grant fine grained permissions on their jobs. With job access controls, job owners can choose which other users or groups can view results of the job. Owners can also choose who can manage runs of their job (Run now and Cancel run permissions).

See Jobs access control for details.

Task configuration options

Dependent libraries

Dependent libraries will be installed on the cluster before the task runs. You must set all task dependencies to ensure they are installed before the run starts.

To add a dependent library, click the Jobs Vertical Ellipsis icon in the upper-right corner of the task card. Click Add Dependent Libraries to open the Add Dependent Library chooser. Follow the recommendations in Library dependencies for specifying dependencies.

Important

If you have configured a library to automatically install on all clusters or you select an existing terminated cluster that has libraries installed, the job execution does not wait for library installation to complete. If a job requires a specific library, you should attach the library to the job in the Dependent Libraries field.

Task parameter variables

You can pass templated variables into a job task as part of the task’s parameters. These variables are replaced with the appropriate values when the job task runs. You can use task parameter values to pass the context about a job run, such as the run ID or the job’s start time.

When a job runs, the task parameter variable surrounded by double curly braces is replaced and appended to an optional string value included as part of the value. For example, to pass a parameter named MyJobId with a value of my-job-6 for any run of job ID 6, add the following task parameter:

{
  "MyJobID": "my-job-{{job_id}}"
}

The double curly braces’ contents are not evaluated as expressions, so you cannot do operations or functions within double-curly braces. Whitespace is not stripped inside the curly braces, so {{ job_id }} will not be evaluated.

The following task parameter variables are supported:

Variable Description
{{job_id}} The unique identifier assigned to a job
{{run_id}} The unique identifier assigned to a job run
{{start_date}} The date a job run started. The format is yyyy-MM-dd in UTC timezone.
{{start_time}} The timestamp of the run’s start of execution after the cluster is created and ready. The format is milliseconds since UNIX epoch in UTC timezone, as returned by System.currentTimeMillis().
{{task_retry_count}} The number of retries that have been attempted to run a task if the first attempt fails. The value is 0 for the first attempt and increments with each retry.

You can set these variables with any task when you Create a job, Edit a job, or Run a job with different parameters.

Timeout

The maximum completion time for a job. If the job does not complete in this time, Azure Databricks sets its status to “Timed Out”.

Retries

A policy that determines when and how many times failed runs are retried.

Retry policy

Note

If you configure both Timeout and Retries, the timeout applies to each retry.

Cluster configuration tips

Cluster configuration is an important part of moving a job to production. The following provides general guidance on choosing and configuring job clusters, followed by recommendations for specific job types.

Choose the correct cluster type for your job

  • New Job Clusters are dedicated clusters that are created and started when you run a job and terminated immediately after the job completes. They are ideal for production-level jobs or jobs that are important to complete, because they provide a fully isolated environment.
  • When you run a job on a new job cluster, the job is treated as a data engineering (job) workload subject to the job workload pricing. When you run a job on an existing all-purpose cluster, the job is treated as a data analytics (all-purpose) workload subject to all-purpose workload pricing.
  • If you select a terminated existing cluster and the job owner has Can Restart permission, Azure Databricks starts the cluster when the job is scheduled to run.
  • Existing all-purpose clusters work best for tasks such as updating dashboards at regular intervals.

Use a pool to reduce cluster start times

To decrease new job cluster start time, create a pool and configure the job’s cluster to use the pool.

Notebook job tips

Total notebook cell output (the combined output of all notebook cells) is subject to a 20MB size limit. Additionally, individual cell output is subject to an 8MB size limit. If total cell output exceeds 20MB in size, or if the output of an individual cell is larger than 8MB, the run is canceled and marked as failed. If you need help finding cells that are near or beyond the limit, run the notebook against an all-purpose cluster and use this notebook autosave technique.

JAR job tips

When running a JAR job, keep in mind the following:

Output size limits

Note

Available in Databricks Runtime 6.3 and above.

Job output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger size, the run is canceled and marked as failed.

To avoid encountering this limit, you can prevent stdout from being returned from the driver to Azure Databricks by setting the spark.databricks.driver.disableScalaOutput Spark configuration to true. By default the flag value is false. The flag controls cell output for Scala JAR jobs and Scala notebooks. If the flag is enabled, Spark does not return job execution results to the client. The flag does not affect the data that is written in the cluster’s log files. Setting this flag is recommended only for job clusters for JAR jobs, because it will disable notebook results.

Use the shared SparkContext

Because Azure Databricks is a managed service, some code changes may be necessary to ensure that your Apache Spark jobs run correctly. JAR job programs must use the shared SparkContext API to get the SparkContext. Because Azure Databricks initializes the SparkContext, programs that invoke new SparkContext() will fail. To get the SparkContext, use only the shared SparkContext created by Azure Databricks:

val goodSparkContext = SparkContext.getOrCreate()
val goodSparkSession = SparkSession.builder().getOrCreate()

There are also several methods you should avoid when using the shared SparkContext.

  • Do not call SparkContext.stop().
  • Do not call System.exit(0) or sc.stop() at the end of your Main program. This can cause undefined behavior.

Use try-finally blocks for job clean up

Consider a JAR that consists of two parts:

  • jobBody() which contains the main part of the job.
  • jobCleanup() which has to be executed after jobBody() whether that function succeeded or returned an exception.

As an example, jobBody() may create tables, and you can use jobCleanup() to drop these tables.

The safe way to ensure that the clean up method is called is to put a try-finally block in the code:

try {
  jobBody()
} finally {
  jobCleanup()
}

You should not try to clean up using sys.addShutdownHook(jobCleanup) or the following code:

val cleanupThread = new Thread { override def run = jobCleanup() }
Runtime.getRuntime.addShutdownHook(cleanupThread)

Due to the way the lifetime of Spark containers is managed in Azure Databricks, the shutdown hooks are not run reliably.

Configure JAR job parameters

You pass parameters to JAR jobs with a JSON string array. For more information, see SparkJarTask. To access these parameters, inspect the String array passed into your main function.

Library dependencies

The Spark driver has certain library dependencies that cannot be overridden. These libraries take priority over any of your own libraries that conflict with them.

To get the full list of the driver library dependencies, run the following command inside a notebook attached to a cluster of the same Spark version (or the cluster with the driver you want to examine).

%sh
ls /databricks/jars

Manage library dependencies

A good rule of thumb when dealing with library dependencies while creating JARs for jobs is to list Spark and Hadoop as provided dependencies. On Maven, add Spark and/or Hadoop as provided dependencies as shown in the following example.

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.11</artifactId>
  <version>2.3.0</version>
  <scope>provided</scope>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-core</artifactId>
  <version>1.2.1</version>
  <scope>provided</scope>
</dependency>

In sbt, add Spark and Hadoop as provided dependencies as shown in the following example.

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0" % "provided"
libraryDependencies += "org.apache.hadoop" %% "hadoop-core" % "1.2.1" % "provided"

Tip

Specify the correct Scala version for your dependencies based on the version you are running.