PipelineStep Class

Reference

Represents an execution step in an Azure Machine Learning pipeline.

Pipelines are constructed from multiple pipeline steps, which are distinct computational units in the pipeline. Each step can run independently and use isolated compute resources. Each step typically has its own named inputs, outputs, and parameters.

The PipelineStep class is the base class from which other built-in step classes designed for common scenarios inherit, such as PythonScriptStep, DataTransferStep, and HyperDriveStep.

For an overview of how Pipelines and PipelineSteps relate, see What are ML Pipelines.

Initialize PipelineStep.

Inheritance: builtins.object

PipelineStep

Constructor

PipelineStep(name, inputs, outputs, arguments=None, fix_port_name_collisions=False, resource_inputs=None)

Parameters

name: str

Required

The name of the pipeline step.

inputs: list

Required

The list of step inputs.

outputs: list

Required

The list of step outputs.

arguments: list

default value: None

An optional list of arguments to pass to a script used in the step.

fix_port_name_collisions: bool

default value: False

Specifies whether to fix name collisions. If True and an input and output have the same name, then the input is prefixed with "INPUT". The default is False.

resource_inputs: list

default value: None

An optional list of inputs to be used as resources. Resources are downloaded to the script folder and provide a way to change the behavior of script at run-time.

name: str

Required

The name of the pipeline step.

inputs: list

Required

The list of step inputs.

outputs: list

Required

The list of step outputs.

arguments: list

Required

An optional list of arguments to pass to a script used in the step.

fix_port_name_collisions: bool

Required

Specifies whether to fix name collisions. If True and an input and output have the same name, then the input is prefixed with "INPUT". The default is False.

resource_inputs: list

Required

An optional list of inputs to be used as resources. Resources are downloaded to the script folder and provide a way to change the behavior of script at run-time.

Remarks

A PipelineStep is a unit of execution that typically needs a target of execution (compute target), a script to execute with optional script arguments and inputs, and can produce outputs. The step also could take a number of other parameters specific to the step.

Pipeline steps can be configured together to construct a Pipeline, which represents a shareable and reusable Azure Machine Learning workflow. Each step of a pipeline can be configured to allow reuse of its previous run results if the step contents (scripts/dependencies) as well as inputs and parameters remain unchanged. When reusing the step, instead of submitting the job to compute, the results from the previous run are immediately made available to any subsequent steps.

Azure Machine Learning Pipelines provides built-in steps for common scenarios. For examples, see the steps package and the AutoMLStep class. For an overview on constructing a Pipeline based on pre-built steps, see https://aka.ms/pl-first-pipeline.

Pre-built steps derived from PipelineStep are steps that are used in one pipeline. If your use machine learning workflow calls for creating steps that can be versioned and used across different pipelines, then use the Module class.

Keep the following in mind when working with pipeline steps, input/output data, and step reuse.

It is recommended that you use separate source_directory locations for separate steps. If all the scripts in your pipeline steps are in a single directory, the hash of that directory changes every time you make a change to one script forcing all steps to rerun. For an example of using separate directories for different steps, see https://aka.ms/pl-get-started.
Maintaining separate folders for scripts and dependent files for each step helps reduce the size of the snapshot created for each step because only the specific folder is snapshotted. Because changes in any files in the step's source_directory trigger a re-upload of the snapshot, maintaining separate folders each step, helps the over reuse of steps in the pipeline because if there are no changes in the source_directory of a step then the step's previous run is reused.
If data used in a step is in a datastore and allow_reuse is True, then changes to the data change won't be detected. If the data is uploaded as part of the snapshot (under the step's source_directory), though this is not recommended, then the hash will change and will trigger a rerun.

Methods

create_input_output_bindings	Create input and output bindings from the step inputs and outputs.
create_module_def	Create the module definition object that describes the step.
create_node	Create a node for the pipeline graph based on this step.
get_source_directory	Get source directory for the step and check that the script exists.
resolve_input_arguments	Match inputs and outputs to arguments to produce an argument string.
run_after	Run this step after the specified step.
validate_arguments	Validate that the step inputs and outputs provided in arguments are in the inputs and outputs lists.

create_input_output_bindings

Create input and output bindings from the step inputs and outputs.

create_input_output_bindings(inputs, outputs, default_datastore, resource_inputs=None)

Parameters

inputs: list

Required

The list of step inputs.

outputs: list

Required

The list of step outputs.

default_datastore: AbstractAzureStorageDatastore or AzureDataLakeDatastore

Required

The default datastore.

resource_inputs: list

default value: None

The list of inputs to be used as resources. Resources are downloaded to the script folder and provide a way to change the behavior of script at run-time.

Returns

Tuple of the input bindings and output bindings.

Return type

list, list

create_module_def

Create the module definition object that describes the step.

create_module_def(execution_type, input_bindings, output_bindings, param_defs=None, create_sequencing_ports=True, allow_reuse=True, version=None, module_type=None, arguments=None, runconfig=None, cloud_settings=None)

Parameters

execution_type: str

Required

The execution type of the module.

input_bindings: list

Required

The step input bindings.

output_bindings: list

Required

The step output bindings.

param_defs: list

default value: None

The step parameter definitions.

create_sequencing_ports: bool

default value: True

Specifies whether sequencing ports will be created for the module.

allow_reuse: bool

default value: True

Specifies whether the module will be available to be reused in future pipelines.

version: str

default value: None

The version of the module.

module_type: str

default value: None

The module type for the module creation service to create. Currently only two types are supported: 'None' and 'BatchInferencing'. module_type is different from execution_type which specifies what kind of backend service to use to run this module.

arguments: list

default value: None

Annotated arguments list to use when calling this module

runconfig: str

default value: None

Runconfig that will be used for python_script_step

cloud_settings: <xref:azureml.pipeline.core._restclients.aeva.models.CloudSettings>

default value: None

Settings that will be used for clouds

Returns

The module definition object.

Return type

ModuleDef

create_node

Create a node for the pipeline graph based on this step.

abstract create_node(graph, default_datastore, context)

Parameters

graph: Graph

Required

The graph to add the node to.

default_datastore: AbstractAzureStorageDatastore or AzureDataLakeDatastore

Required

The default datastore to use for this step.

context: <xref:azureml.pipeline.core._GraphContext>

Required

The graph context object.

Returns

The created node.

Return type

Node

get_source_directory

Get source directory for the step and check that the script exists.

get_source_directory(context, source_directory, script_name)

Parameters

context: <xref:azureml.pipeline.core._GraphContext>

Required

The graph context object.

source_directory: str

Required

The source directory for the step.

script_name: str

Required

The script name for the step.

hash_paths: list

Required

The hash paths to use when determining the module fingerprint.

Returns

The source directory and hash paths.

Return type

str, list

resolve_input_arguments

Match inputs and outputs to arguments to produce an argument string.

static resolve_input_arguments(arguments, inputs, outputs, params)

Parameters

arguments: list

Required

A list of step arguments.

inputs: list

Required

A list of step inputs.

outputs: list

Required

A list of step outputs.

params: list

Required

A list of step parameters.

Returns

Returns a tuple of two items. The first is a flat list of items for the resolved arguments. The second is a list of structured arguments (_InputArgument, _OutputArgument, _ParameterArgument, and _StringArgument)

Return type

tuple

run_after

Run this step after the specified step.

run_after(step)

Parameters

step: PipelineStep

Required

The pipeline step to run before this step.

Remarks

If you want to run a step, say, step3 after both step1 and step2 are completed, you can use:


   step3.run_after(step1)
   step3.run_after(step2)

validate_arguments

Validate that the step inputs and outputs provided in arguments are in the inputs and outputs lists.

static validate_arguments(arguments, inputs, outputs)

Parameters

arguments: list

Required

The list of step arguments.

inputs: list

Required

The list of step inputs.

outputs: list

Required

The list of step outputs.

PipelineStep Class

Constructor

Parameters

Remarks

Methods

create_input_output_bindings

Parameters

Returns

Return type

create_module_def

Parameters

Returns

Return type

create_node

Parameters

Returns

Return type

get_source_directory

Parameters

Returns

Return type

resolve_input_arguments

Parameters

Returns

Return type

run_after

Parameters

Remarks

validate_arguments

Parameters

Feedback

Feedback

Additional resources