DatabricksStep class

Definition

Creates an Azure ML Pipeline step to add a DataBricks notebook, Python script, or JAR as a node.

For an example of using DatabricksStep, see the notebook https://aka.ms/pl-databricks.

DatabricksStep(name, inputs=None, outputs=None, existing_cluster_id=None, spark_version=None, node_type=None, instance_pool_id=None, num_workers=None, min_workers=None, max_workers=None, spark_env_variables=None, spark_conf=None, init_scripts=None, cluster_log_dbfs_path=None, notebook_path=None, notebook_params=None, python_script_path=None, python_script_params=None, main_class_name=None, jar_params=None, python_script_name=None, source_directory=None, hash_paths=None, run_name=None, timeout_seconds=None, runconfig=None, maven_libraries=None, pypi_libraries=None, egg_libraries=None, jar_libraries=None, rcran_libraries=None, compute_target=None, allow_reuse=True, version=None)
Inheritance
azureml.pipeline.core._databricks_step_base._DatabricksStepBase
DatabricksStep

Parameters

name
str

[Required] The name of the step.

inputs
list[InputPortBinding or DataReference or PortDataReference or PipelineData or PipelineOutputAbstractDataset or Dataset or DatasetDefinition or DatasetConsumptionConfig or PipelineDataset]

A list of input connections for data consumed by this step. Fetch this inside the notebook using dbutils.widgets.get("input_name"). Can be DataReference or PipelineData. DataReference represents an existing piece of data on a datastore. Essentially this is a path on a datastore. DatabricksStep supports datastores that encapsulates DBFS, Azure blob, or ADLS v1. PipelineData represents intermediate data produced by another step in a pipeline.

outputs
list[OutputPortBinding or PipelineOutputAbstractDataset or PipelineData]

A list of output port definitions for outputs produced by this step. Fetch this inside the notebook using dbutils.widgets.get("output_name"). Should be PipelineData.

existing_cluster_id
str

A cluster ID of an existing interactive cluster on the Databricks workspace. If you are passing this parameter, you cannot pass any of the following parameters which are used to create a new cluster:

  • spark_version
  • node_type
  • instance_pool_id
  • num_workers
  • min_workers
  • max_workers
  • spark_env_variables
  • spark_conf

Note: For creating a new job cluster, you will need to pass the above parameters. You can pass these parameters directly or you can pass them as part of the RunConfiguration object using the runconfig parameter. Passing these parameters directly and through RunConfiguration results in an error.

spark_version
str

The version of spark for the Databricks run cluster, for example: "4.0.x-scala2.11". For more information, see the description for the existing_cluster_id parameter.

node_type
str

[Required] The Azure VM node types for the Databricks run cluster, for example: "Standard_D3_v2". Specify either node_type or instance_pool_id. For more information, see the description for the existing_cluster_id parameter.

instance_pool_id
str

[Required] The instance pool ID to which the cluster needs to be attached to. Specify either node_type or instance_pool_id. For more information, see the description for the existing_cluster_id parameter.

num_workers
int

[Required] The static number of workers for the Databricks run cluster. You must specify either num_workers or both min_workers and max_workers. For more information, see the description for the existing_cluster_id parameter.

min_workers
int

[Required] The min number of workers to use for auto-scaling the Databricks run cluster. You must specify either num_workers or both min_workers and max_workers. For more information, see the description for the existing_cluster_id parameter.

max_workers
int

[Required] The max number of workers to use for auto-scaling the Databricks run cluster. You must specify either num_workers or both min_workers and max_workers. For more information, see the description for the existing_cluster_id parameter.

spark_env_variables
dict

The spark environment variables for the Databricks run cluster. For more information, see the description for the existing_cluster_id parameter.

spark_conf
dict

The spark configuration for the Databricks run cluster. For more information, see the description for the existing_cluster_id parameter.

init_scripts
[str]

The DBFS paths to init scripts for the new cluster.

cluster_log_dbfs_path
str

The DBFS paths where clusters logs are to be delivered.

notebook_path
str

[Required] The path to the notebook in the Databricks instance. This class allows four ways of specifying the code to execute on the Databricks cluster.

  1. To execute a notebook that is present in the Databricks workspace, use: notebook_path=notebook_path, notebook_params={'myparam': 'testparam'}

  2. To execute a Python script that is present in DBFS, use: python_script_path=python_script_dbfs_path, python_script_params={'arg1', 'arg2'}

  3. To execute a JAR that is present in DBFS, use: main_class_name=main_jar_class_name, jar_params={'arg1', 'arg2'}, jar_libraries=[JarLibrary(jar_library_dbfs_path)]

  4. To execute a Python script that is present on your local machine, use: python_script_name=python_script_name, source_directory=source_directory

Specify exactly one of notebook_path, python_script_path, python_script_name, or main_class_name.

notebook_params
dict({(str):(str), (str):(PipelineParameter)})

A dictionary of parameters to pass to the notebook. notebook_params are available as widgets. You can fetch the values from these widgets inside your notebook using dbutils.widgets.get("myparam").

python_script_path
str

[Required] The path to the python script in the DBFS or S3. Specify exactly one of notebook_path, python_script_path, python_script_name, or main_class_name.

python_script_params
list[str, PipelineParameter]

Parameters for the Python script.

main_class_name
str

[Required] The name of the entry point in a JAR module. Specify exactly one of notebook_path, python_script_path, python_script_name, or main_class_name.

jar_params
list[str, PipelineParameter]

Parameters for the JAR module.

python_script_name
str

[Required] The name of a Python script relative to source_directory. If the script takes inputs and outputs, those will be passed to the script as parameters. If python_script_name is specified then source_directory must be too.

Specify exactly one of notebook_path, python_script_path, python_script_name, or main_class_name.

If you specify a DataReference object as input with data_reference_name=input1 and a PipelineData object as output with name=output1, then the inputs and outputs will be passed to the script as parameters. This is how they will look like and you will need to parse the arguments in your script to access the paths of each input and output: "-input1","wasbs://test@storagename.blob.core.windows.net/test","-output1", "wasbs://test@storagename.blob.core.windows.net/b3e26de1-87a4-494d-a20f-1988d22b81a2/output1"

In addition, the following parameters will be available within the script:

  • AZUREML_RUN_TOKEN: The AML token for authenticating with Azure Machine Learning.
  • AZUREML_RUN_TOKEN_EXPIRY: The AML token expiry time.
  • AZUREML_RUN_ID: Azure Machine Learning Run ID for this run.
  • AZUREML_ARM_SUBSCRIPTION: Azure subscription for your AML workspace.
  • AZUREML_ARM_RESOURCEGROUP: Azure resource group for your Azure Machine Learning workspace.
  • AZUREML_ARM_WORKSPACE_NAME: Name of your Azure Machine Learning workspace.
  • AZUREML_ARM_PROJECT_NAME: Name of your Azure Machine Learning experiment.
  • AZUREML_SCRIPT_DIRECTORY_NAME: Directory path structure in DBFS where source_directory has been copied.
  • AZUREML_SERVICE_ENDPOINT: The endpoint URL for AML services.

When you are executing a Python script from your local machine on Databricks using AZUREML_SCRIPT_DIRECTORY_NAME_ARG_VARIABLE DatabricksStep your source_directory is copied over to DBFS and the directory structure path on DBFS is passed as a parameter to your script when it begins execution. This parameter is labelled as --AZUREML_SCRIPT_DIRECTORY_NAME. You need to prefix it with the string "dbfs:/" or "/dbfs/" to access the directory in DBFS.

source_directory
str

The folder that contains the script and other files. If python_script_name is specified then source_directory must be too.

hash_paths
[str]

DEPRECATED: no longer needed.

A list of paths to hash when checking for changes to the step contents. If there are no changes detected, the pipeline will reuse the step contents from a previous run. By default, the contents of source_directory is hashed except for files listed in .amlignore or .gitignore.

run_name
str

The name in Databricks for this run.

timeout_seconds
int

The timeout for the Databricks run.

runconfig
RunConfiguration

The runconfig to use.

Note: You can pass as many libraries as you like as dependencies to your job using the following parameters: maven_libraries, pypi_libraries, egg_libraries, jar_libraries, or rcran_libraries. Either pass these parameters directly with their corresponding parameters or as part of the RunConfiguration object using the runconfig parameter, but not both.

maven_libraries
list[MavenLibrary]

Maven libraries to use for the Databricks run.

pypi_libraries
list[PyPiLibrary]

PyPi libraries to use for the Databricks run.

egg_libraries
list[EggLibrary]

Egg libraries to use for the Databricks run.

jar_libraries
list[JarLibrary]

Jar libraries to use for the Databricks run.

rcran_libraries
list[RCranLibrary]

RCran libraries to use for the Databricks run.

compute_target
str, DatabricksCompute

[Required] An Azure Databricks compute. Before you can use DatabricksStep to execute your scripts or notebooks on an Azure Databricks workspace, you need to add the Azure Databricks workspace as a compute target to your Azure Machine Learning workspace.

allow_reuse
bool

Indicates whether the step should reuse previous results when re-run with the same settings. Reuse is enabled by default. If the step contents (scripts/dependencies) as well as inputs and parameters remain unchanged, the output from the previous run of this step is reused. When reusing the step, instead of submitting the job to compute, the results from the previous run are immediately made available to any subsequent steps. If you use Azure Machine Learning datasets as inputs, reuse is determined by whether the dataset's definition has changed, not by whether the underlying data has changed.

version
str

An optional version tag to denote a change in functionality for the step.

Remarks

The following example shows how to use DatabricksStep in an Azure Machine Learning Pipeline. In this example, a notebook in the Databricks workspace is run.


   notebook_path=os.getenv("DATABRICKS_NOTEBOOK_PATH", "<my-databricks-notebook-path>") # Databricks notebook path

   dbNbStep = DatabricksStep(
       name="DBNotebookInWS",
       inputs=[step_1_input],
       outputs=[step_1_output],
       num_workers=1,
       notebook_path=notebook_path,
       notebook_params={'myparam': 'testparam'},
       run_name='DB_Notebook_demo',
       compute_target=databricks_compute,
       allow_reuse=True
   )

Full sample is available from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/azure-databricks/databricks-as-remote-compute-target/aml-pipelines-use-databricks-as-compute-target.ipynb

Methods

create_node(graph, default_datastore, context)

Create a node from the Databricks step and add it to the specified graph.

This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.

create_node(graph, default_datastore, context)

Create a node from the Databricks step and add it to the specified graph.

This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.

create_node(graph, default_datastore, context)

Parameters

graph
Graph

The graph object to add the node to.

default_datastore
AbstractAzureStorageDatastore or AzureDataLakeDatastore

The default datastore.

context
_GraphContext

The graph context.

Returns

The created node.

Return type