ParallelRunConfig Class

Defines configuration for a ParallelRunStep object.

For an example of using ParallelRunStep, see the notebook https://aka.ms/batch-inference-notebooks.

Inheritance
azureml.pipeline.core._parallel_run_config_base._ParallelRunConfigBase
ParallelRunConfig

Constructor

ParallelRunConfig(environment, entry_script, error_threshold, output_action, compute_target, node_count, process_count_per_node=None, mini_batch_size=None, source_directory=None, description=None, logging_level='INFO', run_invocation_timeout=60, run_max_try=3, append_row_file_name=None)

Parameters

environment
Environment

The environment definition that configures the Python environment. It can be configured to use an existing Python environment or to set up a temp environment for the experiment. The environment definition is responsible for defining the required application dependencies, such as conda or pip packages.

entry_script
str

User script which will be run in parallel on multiple nodes. This is specified as a local file path. If source_directory is specified, then entry_script is a relative path inside the directory. Otherwise, it can be any path accessible on the machine.

error_threshold
int

The number of record failures for TabularDataset and file failures for FileDataset that should be ignored during processing. If the error count goes above this value, then the job will be aborted. Error threshold is for the entire input and not for individual mini-batches sent to run() method. The range is [-1, int.max]. -1 indicates ignore all failures during processing.

output_action
str

How the output should be organized. Current supported values are 'append_row' and 'summary_only'.

  1. 'append_row' – All values output by run() method invocations will be aggregated into one unique file named parallel_run_step.txt which is created in the output location.
  2. 'summary_only' – User script is expected to store the output itself. An output row is still expected for each successful input item processed. The system uses this output only for error threshold calculation (ignoring the actual value of the row).
compute_target
AmlCompute or str

Compute target to use for ParallelRunStep execution. This parameter may be specified as a compute target object or the name of a compute target in the workspace.

node_count
int

Number of nodes in the compute target used for running the ParallelRunStep.

process_count_per_node
int

Number of processes executed on each node. (optional, default value is number of cores on node.)

mini_batch_size
<xref:Union>[str, int]

For FileDataset input, this field is the number of files a user script can process in one run() call. For TabularDataset input, this field is the approximate size of data the user script can process in one run() call. Example values are 1024, 1024KB, 10MB, and 1GB. (optional, default value is 10 files for FileDataset and 1MB for TabularDataset.)

source_directory
str

Path to folder that contains the entry_script and supporting files used to execute on compute target.

description
str

A description to give the batch service used for display purposes.

logging_level
str

A string of the logging level name, which is defined in 'logging'. Possible values are 'WARNING', 'INFO', and 'DEBUG'. (optional, default value is 'INFO'.)

run_invocation_timeout
int

Timeout in seconds for each invocation of the run() method. (optional, default value is 60.)

run_max_try
int

The number of maximum tries for a failed or timeout mini batch. The range is [1, int.max]. The default value is 3. A mini batch with dequeue count greater than this won't be processed again and will be deleted directly.

append_row_file_name
str

The name of the output file if the output_action is 'append_row'. (optional, default value is 'parallel_run_step.txt')

Remarks

The ParallelRunConfig class is used to provide configuration for the ParallelRunStep class. ParallelRunConfig and ParallelRunStep can be used together for processing large amounts of data in parallel. Common use cases are training an ML model or running offline inference to generate predictions on a batch of observations. ParallelRunStep works by breaking up your data into batches that are processed in parallel. The batch size, node count, and other tunable parameters to speed up your parallel processing can be controlled with the ParallelRunConfig class. ParallelRunStep can work with either TabularDataset or FileDataset as input.

To use ParallelRunStep and ParallelRunConfig:

  • Create a ParallelRunConfig object to specify how batch processing is performed, with parameters to control batch size, number of nodes per compute target, and a reference to your custom Python script.

  • Create a ParallelRunStep object that uses the ParallelRunConfig object, defines inputs and outputs for the step.

  • Use the configured ParallelRunStep object in a Pipeline just as you would with other pipeline step types.

Examples of working with ParallelRunStep and ParallelRunConfig classes for batch inference are discussed in the following articles:


   from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig

   parallel_run_config = ParallelRunConfig(
       source_directory=scripts_folder,
       entry_script=script_file,
       mini_batch_size="5",
       error_threshold=10,
       run_max_try=3,
       output_action="append_row",
       environment=batch_env,
       compute_target=compute_target,
       node_count=2)

   parallelrun_step = ParallelRunStep(
       name="predict-digits-mnist",
       parallel_run_config=parallel_run_config,
       inputs=[ named_mnist_ds ],
       output=output_dir,
       arguments=[ "--extra_arg", "example_value" ],
       allow_reuse=True
   )

For more information about this example, see the notebook https://aka.ms/batch-inference-notebooks.

Methods

load_yaml

Load parallel run configuration data from a YAML file.

save_to_yaml

Export parallel run configuration data to a YAML file.

load_yaml

Load parallel run configuration data from a YAML file.

load_yaml(workspace, path)

Parameters

workspace
Workspace

The workspace to read the configuration data from.

path
str

The path to load the configuration from.

save_to_yaml

Export parallel run configuration data to a YAML file.

save_to_yaml(path)

Parameters

path
str

The path to save the file to.