Migrate Production Workloads to Azure Databricks

This guide explains how to move your production jobs from Apache Spark on other platforms to Apache Spark on Azure Databricks.

Concepts

Databricks job

A single unit of code that you can bundle and submit to Azure Databricks. An Azure Databricks job is equivalent to a Spark application with a single SparkContext. The entry point can be in a library (for example, JAR, egg, wheel) or a notebook. You can run Azure Databricks jobs on a schedule with sophisticated retries and alerting mechanisms. The primary interfaces for running jobs are the Jobs API and UI.

Pool

A set of instances in your account that are managed by Azure Databricks but incur no Azure Databricks charges when they are idle. Submitting multiple jobs on a pool ensures your jobs start quickly. You can set guardrails (instance types, instance limits, and so on) and autoscaling policies for the pool of instances. A pool is equivalent to an autoscaling cluster on other Spark platforms.

Migration steps

This section provides the steps for moving your production jobs to Azure Databricks.

Step 1: Create a pool

Create an autoscaling pool. This is equivalent to creating an autoscaling cluster in other Spark platforms. On other platforms, if instances in the autoscaling cluster are idle for a few minutes or hours, you pay for them. Azure Databricks manages the instance pool for you for free. That is, you don’t pay Azure Databricks if these machines are not in use; you pay only the cloud provider. Azure Databricks charges only when jobs are run on the instances.

no-alternative-text

Key configurations:

  • Min Idle: Number of standby instances, not in use by jobs, that the pool maintains. You can set this to 0.
  • Max Capacity: This is an optional field. If you already have cloud provider instance limits set, you can leave this field empty. If you want to set additional max limits, set a high value so that a large number of jobs can share the pool.
  • Idle Instance Auto Termination: The instances over Min Idle are released back to the cloud provider if they are idle for the specified period. The higher the value, the more the instances are kept ready and thereby your jobs will start faster.

Step 2: Run a job on a pool

You can run a job on a pool using the Jobs API or the UI. You must run each job by providing a cluster spec. When a job is about to start, Azure Databricks automatically creates a new cluster from the pool. The cluster is automatically terminated when the job finishes. You are charged exactly for the amount of time your job was run. This is the most cost-effective way to run jobs on Azure Databricks. Each new cluster has:

  • One associated SparkContext, which is equivalent to a Spark application on other Spark platforms.
  • A driver node and a specified number of workers. For a single job, you can specify a worker range. Azure Databricks autoscales a single Spark job based on the resources needed for that job. Azure Databricks benchmarks show that this can save you up to 30% on cloud costs, depending on the nature of your job.

API / CLI

  1. Download and configure the Databricks CLI.

  2. Run the following command to submit your code one time. The API returns a URL that you can use to track the progress of the job run.

    databricks runs submit --json
    
    {
      "run_name": "my spark job",
      "new_cluster": {
        "spark_version": "5.0.x-scala2.11",
    
        "instance_pool_id": "0313-121005-test123-pool-ABCD1234"**,**
        "num_workers": 10
        },
        "libraries": [
        {
        "jar": "dbfs:/my-jar.jar"
        }
    
        ],
        "timeout_seconds": 3600,
        "spark_jar_task": {
        "main_class_name": "com.databricks.ComputeModels"
      }
    }
    
  3. To schedule a job, use the following example. Jobs created through this mechanism are displayed in the jobs list page. The return value is a job_id that you can use to look at the status of all the runs.

    databricks jobs create --json
    
    {
      "name": "Nightly model training",
      "new_cluster": {
         "spark_version": "5.0.x-scala2.11",
         ...
         **"instance_pool_id": "0313-121005-test123-pool-ABCD1234",**
         "num_workers": 10
       },
       "libraries": [
         {
         "jar": "dbfs:/my-jar.jar"
         }
       ],
       "email_notifications": {
         "on_start": ["john@foo.com"],
         "on_success": ["sally@foo.com"],
         "on_failure": ["bob@foo.com"]
       },
       "timeout_seconds": 3600,
       "max_retries": 2,
       "schedule": {
       "quartz_cron_expression": "0 15 22 ? \* \*",
       "timezone_id": "America/Los_Angeles"
       },
       "spark_jar_task": {
         "main_class_name": "com.databricks.ComputeModels"
      }
    }
    

If you use spark-submit to submit Spark jobs, the following table shows how spark-submit parameters map to different arguments in the Jobs Create API.

spark-submit parameter How it applies on Azure Databricks
–class Use the Spark JAR task to provide the main class name and the parameters.
–jars Use the libraries argument to provide the list of dependencies.
–py-files For Python jobs, use the Spark Python task. You can use the libraries argument to provide egg or wheel dependencies.
–master In the cloud, you don’t need to manage a long running master node. All the instances and jobs are managed by Azure Databricks services. Ignore this parameter.
–deploy-mode Ignore this parameter on Azure Databricks.
–conf In the NewCluster spec, use the spark_conf argument.
–num-executors In the NewCluster spec, use the num_workers argument. You can also use the autoscale option to provide a range (recommended).
–driver-memory, –driver-cores Based on the driver memory and cores you need, choose an appropriate instance type.
You will provide the instance type for the driver during the pool creation. Ignore this parameter during job submission.
–executor-memory, –executor-cores Based on the executor memory you need, choose an appropriate instance type.
You will provide the instance type for the workers during the pool creation. Ignore this parameter during job submission.
–driver-class-path Set spark.driver.extraClassPath to the appropriate value in spark_conf argument.
–driver-java-options Set spark.driver.extraJavaOptions to the appropriate value in the spark_conf argument.
–files Set spark.files to the appropriate value in the spark_conf argument.
–name In the Runs Submit request, use the run_name argument. In the Create Job request, use the name argument.

Airflow

Azure Databricks offers an Airflow operator if you want to use Airflow to submit jobs in Azure Databricks. The Databricks Airflow operator calls the Jobs Run API to submit jobs to Azure Databricks. See Apache Airflow.

UI

Azure Databricks provides a simple and intuitive easy-to-use UI to submit and schedule jobs. To create and submit jobs from the UI, follow the step-by-step guide.

Step 3: Troubleshoot jobs

Azure Databricks provides lots of tools to help you troubleshoot your jobs.

Access logs and Spark UI

Azure Databricks maintains a fully managed Spark history server to allow you to access all the Spark logs and Spark UI for each job run. They can be accessed from the job details page as well as the job run page:

no-alternative-text

Forward logs

You can also forward cluster logs to your cloud storage location. To send logs to your location of choice, use the cluster_log_conf parameter in the NewCluster spec.

View metrics

While the job is running, you can go to the cluster page and look at the live Ganglia metrics in the Metrics tab. Azure Databricks also snapshots these metrics every 15 minutes and stores them, so you can look at these metrics even after your job is completed. To send metrics to your metrics server, you can install custom agents in the cluster. See Monitor performance.

no-alternative-text

Set alerts

Use email_notifications in the Job Create API to get alerts on job failures. You can also forward these email alerts to PagerDuty, Slack, and other monitoring systems.

Frequently asked questions (FAQs)

Can I run jobs without a pool?

Yes. Pools are optional. You can directly run jobs on a new cluster. In such cases, Azure Databricks creates the cluster by asking the cloud provider for the required instances. With pools, cluster startup time will be around 30s if instances are available in the pool.

What is a notebook job?

Azure Databricks has different job types - JAR, Python, notebook. A notebook job type runs code in the specified notebook. See Notebook job tips.

When should I use a notebook job when compared to JAR job?

A JAR job is equivalent to a spark-submit job. It executes the JAR and then you can look at the logs and Spark UI for troubleshooting. A notebook job executes the specified notebook. You can import libraries in a notebook and call your libraries from the notebook too. The advantage of using a notebook job as the main entry point is you can easily debug your production jobs’ intermediate results in the notebook output area. See JAR job tips.

Can I connect to my own Hive metastore?

Yes, Azure Databricks supports external Hive metastores. See External Apache Hive Metastore.