ParallelRunStep Class

Creates an Azure Machine Learning Pipeline step to process large amounts of data asynchronously and in parallel.

For an example of using ParallelRunStep, see the notebook https://aka.ms/batch-inference-notebooks.

Inheritance
azureml.pipeline.core._parallel_run_step_base._ParallelRunStepBase
ParallelRunStep

Constructor

ParallelRunStep(name, parallel_run_config, inputs, output=None, side_inputs=None, arguments=None, allow_reuse=True)

Parameters

name
str

Name of the step. Must be unique to the workspace, only consist of lowercase letters, numbers, or dashes, start with a letter, and be between 3 and 32 characters long.

parallel_run_config
ParallelRunConfig

A ParallelRunConfig object used to determine required run properties.

inputs
list[DatasetConsumptionConfig or <xref:azureml.data.dataset_consumption_config.PipelineOutputFileDataset> or <xref:azureml.data.dataset_consumption_config.PipelineOutputTabularDataset>]

List of input datasets. All datasets in the list should be of same type. Input data will be partitioned for parallel processing.

output
PipelineData or OutputPortBinding or OutputDatasetConfig

Output port binding, may be used by later pipeline steps.

side_inputs
list[InputPortBinding or DataReference or PortDataReference or PipelineData or PipelineOutputFileDataset or PipelineOutputTabularDataset or DatasetConsumptionConfig]

List of side input reference data. Side inputs will not be partitioned as input data.

arguments
list[str]

List of command-line arguments to pass to the Python entry_script.

allow_reuse
bool

Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution.

Remarks

ParallelRunStep can be used for processing large amounts of data in parallel. Common use cases are training an ML model or running offline inference to generate predictions on a batch of observations. ParallelRunStep works by breaking up your data into batches that are processed in parallel. The batch size node count, and other tunable parameters to speed up your parallel processing can be controlled with the ParallelRunConfig class. ParallelRunStep can work with either TabularDataset or FileDataset as input.

To use ParallelRunStep:

  • Create a ParallelRunConfig object to specify how batch processing is performed, with parameters to control batch size, number of nodes per compute target, and a reference to your custom Python script.

  • Create a ParallelRunStep object that uses the ParallelRunConfig object, define inputs and outputs for the step.

  • Use the configured ParallelRunStep object in a Pipeline just as you would with other pipeline step types.

Examples of working with ParallelRunStep and ParallelRunConfig classes for batch inference are discussed in the following articles:


   from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig

   parallel_run_config = ParallelRunConfig(
       source_directory=scripts_folder,
       entry_script=script_file,
       mini_batch_size="5",
       error_threshold=10,
       output_action="append_row",
       environment=batch_env,
       compute_target=compute_target,
       node_count=2)

   parallelrun_step = ParallelRunStep(
       name="predict-digits-mnist",
       parallel_run_config=parallel_run_config,
       inputs=[ named_mnist_ds ],
       output=output_dir,
       arguments=[ "--extra_arg", "example_value" ],
       allow_reuse=True
   )

For more information about this example, see the notebook https://aka.ms/batch-inference-notebooks.

Methods

create_module_def

Create the module definition object that describes the step.

This method is not intended to be used directly.

create_node

Create a node for PythonScriptStep and add it to the specified graph.

This method is not intended to be used directly. When a pipeline is instantiated with ParallelRunStep, Azure Machine Learning automatically passes the parameters required through this method so that the step can be added to a pipeline graph that represents the workflow.

create_module_def

Create the module definition object that describes the step.

This method is not intended to be used directly.

create_module_def(execution_type, input_bindings, output_bindings, param_defs=None, create_sequencing_ports=True, allow_reuse=True, version=None, arguments=None)

Parameters

execution_type
str

The execution type of the module.

input_bindings
list

The step input bindings.

output_bindings
list

The step output bindings.

param_defs
list
default value: None

The step param definitions.

create_sequencing_ports
bool
default value: True

If true, sequencing ports will be created for the module.

allow_reuse
bool
default value: True

If true, the module will be available to be reused in future Pipelines.

version
str
default value: None

The version of the module.

arguments
list
default value: None

Annotated arguments list to use when calling this module.

Returns

The module def object.

Return type

create_node

Create a node for PythonScriptStep and add it to the specified graph.

This method is not intended to be used directly. When a pipeline is instantiated with ParallelRunStep, Azure Machine Learning automatically passes the parameters required through this method so that the step can be added to a pipeline graph that represents the workflow.

create_node(graph, default_datastore, context)

Parameters

graph
Graph

Graph object.

default_datastore
AbstractAzureStorageDatastore or AzureDataLakeDatastore

Default datastore.

context
<xref:_GraphContext>

Context.

Returns

The created node.

Return type