PipelineData class

Definition

Represents intermediate data in an Azure Machine Learning pipeline.

Data used in pipeline can be produced by one step and consumed in another step by providing a PipelineData object as an output of one step and an input of one or more subsequent steps.

PipelineData(name, datastore=None, output_name=None, output_mode='mount', output_path_on_compute=None, output_overwrite=None, data_type=None, is_directory=None, pipeline_output_name=None, training_output=None)
Inheritance
builtins.object
PipelineData

Parameters

name
str

The name of the PipelineData object, which can contain only letters, digits, and underscores.

datastore
AbstractAzureStorageDatastore or AzureDataLakeDatastore

The Datastore the PipelineData will reside on. If unspecified, the default datastore is used.

output_name
str

The name of the output, if None name is used. Can contain only letters, digits, and underscores.

output_mode
str

Specifies whether the producing step will use "upload" or "mount" method to access the data.

output_path_on_compute
str

For output_mode = "upload", this parameter represents the path the module writes the output to.

output_overwrite
bool

For output_mode = "upload", this parameter specifies whether to overwrite existing data.

data_type
str

Optional. Data type can be used to specify the expected type of the output and to detail how consuming steps should use the data. It can be any user-defined string.

is_directory
bool

Specifies whether the data is a directory or single file. This is only used to determine a data type used by Azure ML backend when the data_type parameter is not provided. The default is False.

pipeline_output_name

If provided this output will be available by using PipelineRun.get_pipeline_output(). Pipeline output names must be unique in the pipeline.

training_output
TrainingOutput or optional

Defines output for training result. This is needed only for specific trainings which result in different kinds of outputs such as Metrics and Model. For example, AutoMLStep results in metrics and model. You can also define specific training iteration or metric used to get best model.

Remarks

PipelineData represents data output a step will produce when it is run. Use PipelineData when creating steps to describe the files or directories which will be generated by the step. These data outputs will be added to the specified Datastore and can be retrieved and viewed later.

For example, the following pipeline step produces one output, named "model":


   from azureml.pipeline.core import PipelineData
   from azureml.pipeline.steps import PythonScriptStep

   datastore = ws.get_default_datastore()
   step_output = PipelineData("model", datastore=datastore)
   step = PythonScriptStep(script_name="train.py",
                           arguments=["--model", step_output],
                           outputs=[step_output],
                           compute_target=aml_compute,
                           source_directory=source_directory)

In this case, the train.py script will write the model it produces to the location which is provided to the script through the --model argument.

PipelineData objects are also used when constructing Pipelines to describe step dependencies. To specify that a step requires the output of another step as input, use a PipelineData object in the constructor of both steps.

For example, the pipeline train step depends on the process_step_output output of the pipeline process step:


   from azureml.pipeline.core import Pipeline, PipelineData
   from azureml.pipeline.steps import PythonScriptStep

   datastore = ws.get_default_datastore()
   process_step_output = PipelineData("processed_data", datastore=datastore)
   process_step = PythonScriptStep(script_name="process.py",
                                   arguments=["--data_for_train", process_step_output],
                                   outputs=[process_step_output],
                                   compute_target=aml_compute,
                                   source_directory=process_directory)
   train_step = PythonScriptStep(script_name="train.py",
                                 arguments=["--data_for_train", process_step_output],
                                 inputs=[process_step_output],
                                 compute_target=aml_compute,
                                 source_directory=train_directory)

   pipeline = Pipeline(workspace=ws, steps=[process_step, train_step])

This will create a Pipeline with two steps. The process step will be executed first, then after it has completed, the train step will be executed. Azure ML will provide the output produced by the process step to the train step.

See this page for further examples of using PipelineData to construct a Pipeline: https://aka.ms/pl-data-dep

For supported compute types, PipelineData can also be used to specify how the data will be produced and consumed by the run. There are two supported methods:

  • Mount (default): The input or output data is mounted to local storage on the compute node, and an environment variable is set which points to the path of this data ($AZUREML_DATAREFERENCE_name). For convenience, you can pass the PipelineData object in as one of the arguments to your script, for example using the arguments parameter of PythonScriptStep, and the object will resolve to the path to the data. For outputs, your compute script should create a file or directory at this output path. To see the value of the environment variable used when you pass in the Pipeline object as an argument, use the get_env_variable_name() method.

  • Upload: Specify an output_path_on_compute corresponding to a file or directory name that your script will generate. (Environment variables are not used in this case.)

Methods

as_dataset()

Promote the intermediate output into a Dataset.

This dataset is a dataset that will exist after the step has executed.

as_download(input_name=None, path_on_compute=None, overwrite=None)

Consume the PipelineData as download.

as_input(input_name)

Create an InputPortBinding and specify an input name (but use default mode).

as_mount(input_name=None)

Consume the PipelineData as mount.

create_input_binding(input_name=None, mode=None, path_on_compute=None, overwrite=None)

Create input binding.

get_env_variable_name()

Return the name of the environment variable for this PipelineData.

as_dataset()

Promote the intermediate output into a Dataset.

This dataset is a dataset that will exist after the step has executed.

as_dataset()

Returns

The intermediate output as a Dataset.

Return type

as_download(input_name=None, path_on_compute=None, overwrite=None)

Consume the PipelineData as download.

as_download(input_name=None, path_on_compute=None, overwrite=None)

Parameters

input_name
str

Use to specify a name for this input.

default value: None
path_on_compute
str

The path on the compute to download to.

default value: None
overwrite
bool

Use to indicate whether to overwrite existing data.

default value: None

Returns

The InputPortBinding with this PipelineData as the source.

Return type

as_input(input_name)

Create an InputPortBinding and specify an input name (but use default mode).

as_input(input_name)

Parameters

input_name
str

Use to specify a name for this input.

Returns

The InputPortBinding with this PipelineData as the source.

Return type

as_mount(input_name=None)

Consume the PipelineData as mount.

as_mount(input_name=None)

Parameters

input_name
str

Use to specify a name for this input.

default value: None

Returns

The InputPortBinding with this PipelineData as the source.

Return type

create_input_binding(input_name=None, mode=None, path_on_compute=None, overwrite=None)

Create input binding.

create_input_binding(input_name=None, mode=None, path_on_compute=None, overwrite=None)

Parameters

input_name
str

The name of the input.

default value: None
mode
str

The mode to access the PipelineData ("mount" or "download").

default value: None
path_on_compute
str

For "download" mode, the path on the compute the data will reside.

default value: None
overwrite
bool

For "download" mode, whether to overwrite existing data.

default value: None

Returns

The InputPortBinding with this PipelineData as the source.

Return type

get_env_variable_name()

Return the name of the environment variable for this PipelineData.

get_env_variable_name()

Returns

The environment variable name.

Return type

str

Attributes

data_type

Type of data which will be produced.

Returns

The data type name.

Return type

str

datastore

Datastore the PipelineData will reside on.

Returns

The Datastore object.

Return type

name

Name of the PipelineData object.

Returns

Name.

Return type

str