PythonScriptStep Class

Reference

Creates an Azure ML Pipeline step that runs Python script.

For an example of using PythonScriptStep, see the notebook https://aka.ms/pl-get-started.

Create an Azure ML Pipeline step that runs Python script.

Inheritance: azureml.pipeline.core._python_script_step_base._PythonScriptStepBase

PythonScriptStep

Constructor

PythonScriptStep(script_name, name=None, arguments=None, compute_target=None, runconfig=None, runconfig_pipeline_params=None, inputs=None, outputs=None, params=None, source_directory=None, allow_reuse=True, version=None, hash_paths=None)

Parameters

Name	Description
script_name Required	str [Required] The name of a Python script relative to `source_directory`.
name	str The name of the step. If unspecified, `script_name` is used. default value: None
arguments	list Command line arguments for the Python script file. The arguments will be passed to compute via the `arguments` parameter in RunConfiguration. For more details how to handle arguments such as special symbols, see the RunConfiguration. default value: None
compute_target	Union[DsvmCompute, AmlCompute, RemoteCompute, HDInsightCompute, str, tuple] [Required] The compute target to use. If unspecified, the target from the runconfig will be used. This parameter may be specified as a compute target object or the string name of a compute target on the workspace. Optionally if the compute target is not available at pipeline creation time, you may specify a tuple of ('compute target name', 'compute target type') to avoid fetching the compute target object (AmlCompute type is 'AmlCompute' and RemoteCompute type is 'VirtualMachine'). default value: None
runconfig	RunConfiguration The optional RunConfiguration to use. A RunConfiguration can be used to specify additional requirements for the run, such as conda dependencies and a docker image. If unspecified, a default runconfig will be created. default value: None
runconfig_pipeline_params	dict[str, PipelineParameter] Overrides of runconfig properties at runtime using key-value pairs each with name of the runconfig property and PipelineParameter for that property. Supported values: 'NodeCount', 'MpiProcessCountPerNode', 'TensorflowWorkerCount', 'TensorflowParameterServerCount' default value: None
inputs	list[Union[InputPortBinding, DataReference, PortDataReference, PipelineData, PipelineOutputFileDataset, PipelineOutputTabularDataset, DatasetConsumptionConfig]] A list of input port bindings. default value: None
outputs	list[Union[PipelineData, OutputDatasetConfig, PipelineOutputFileDataset, PipelineOutputTabularDataset, OutputPortBinding]] A list of output port bindings. default value: None
params	dict A dictionary of name-value pairs registered as environment variables with "AML_PARAMETER_". default value: None
source_directory	str A folder that contains Python script, conda env, and other resources used in the step. default value: None
allow_reuse	bool Indicates whether the step should reuse previous results when re-run with the same settings. Reuse is enabled by default. If the step contents (scripts/dependencies) as well as inputs and parameters remain unchanged, the output from the previous run of this step is reused. When reusing the step, instead of submitting the job to compute, the results from the previous run are immediately made available to any subsequent steps. If you use Azure Machine Learning datasets as inputs, reuse is determined by whether the dataset's definition has changed, not by whether the underlying data has changed. default value: True
version	str An optional version tag to denote a change in functionality for the step. default value: None
hash_paths	list DEPRECATED: no longer needed. A list of paths to hash when checking for changes to the step contents. If there are no changes detected, the pipeline will reuse the step contents from a previous run. By default, the contents of `source_directory` is hashed except for files listed in .amlignore or .gitignore. default value: None
script_name Required	str [Required] The name of a Python script relative to `source_directory`.
name Required	str The name of the step. If unspecified, `script_name` is used.
arguments Required	[str] Command line arguments for the Python script file. The arguments will be passed to compute via the `arguments` parameter in RunConfiguration. For more details how to handle arguments such as special symbols, see the RunConfiguration.
compute_target Required	Union[DsvmCompute, AmlCompute, RemoteCompute, HDInsightCompute, str, tuple] [Required] The compute target to use. If unspecified, the target from the runconfig will be used. This parameter may be specified as a compute target object or the string name of a compute target on the workspace. Optionally if the compute target is not available at pipeline creation time, you may specify a tuple of ('compute target name', 'compute target type') to avoid fetching the compute target object (AmlCompute type is 'AmlCompute' and RemoteCompute type is 'VirtualMachine').
runconfig Required	RunConfiguration The optional RunConfiguration to use. RunConfiguration can be used to specify additional requirements for the run, such as conda dependencies and a docker image. If unspecified, a default runconfig will be created.
runconfig_pipeline_params Required	dict[str, PipelineParameter] Overrides of runconfig properties at runtime using key-value pairs each with name of the runconfig property and PipelineParameter for that property. Supported values: 'NodeCount', 'MpiProcessCountPerNode', 'TensorflowWorkerCount', 'TensorflowParameterServerCount'
inputs Required	list[Union[InputPortBinding, DataReference, PortDataReference, PipelineData, PipelineOutputFileDataset, PipelineOutputTabularDataset, DatasetConsumptionConfig]] A list of input port bindings.
outputs Required	list[Union[PipelineData, OutputDatasetConfig, PipelineOutputFileDataset, PipelineOutputTabularDataset, OutputPortBinding]] A list of output port bindings.
params Required	<xref:<xref:{str: str}>> A dictionary of name-value pairs. Registered as environment variables with ">>AML_PARAMETER_<<".
source_directory Required	str A folder that contains Python script, conda env, and other resources used in the step.
allow_reuse Required	bool Indicates whether the step should reuse previous results when re-run with the same settings. Reuse is enabled by default. If the step contents (scripts/dependencies) as well as inputs and parameters remain unchanged, the output from the previous run of this step is reused. When reusing the step, instead of submitting the job to compute, the results from the previous run are immediately made available to any subsequent steps. If you use Azure Machine Learning datasets as inputs, reuse is determined by whether the dataset's definition has changed, not by whether the underlying data has changed.
version Required	str An optional version tag to denote a change in functionality for the step.
hash_paths Required	list DEPRECATED: no longer needed. A list of paths to hash when checking for changes to the step contents. If there are no changes detected, the pipeline will reuse the step contents from a previous run. By default, the contents of `source_directory` is hashed except for files listed in .amlignore or .gitignore.

Remarks

A PythonScriptStep is a basic, built-in step to run a Python Script on a compute target. It takes a script name and other optional parameters like arguments for the script, compute target, inputs and outputs. If no compute target is specified, the default compute target for the workspace is used. You can also use a RunConfiguration to specify requirements for the PythonScriptStep, such as conda dependencies and docker image.

The best practice for working with PythonScriptStep is to use a separate folder for scripts and any dependent files associated with the step, and specify that folder with the source_directory parameter. Following this best practice has two benefits. First, it helps reduce the size of the snapshot created for the step because only what is needed for the step is snapshotted. Second, the step's output from a previous run can be reused if there are no changes to the source_directory that would trigger a re-upload of the snapshot.

The following code example shows using a PythonScriptStep in a machine learning training scenario. For more details on this example, see https://aka.ms/pl-first-pipeline.


   from azureml.pipeline.steps import PythonScriptStep

   trainStep = PythonScriptStep(
       script_name="train.py",
       arguments=["--input", blob_input_data, "--output", output_data1],
       inputs=[blob_input_data],
       outputs=[output_data1],
       compute_target=compute_target,
       source_directory=project_folder
   )

PythonScriptSteps support a number of input and output types. These include DatasetConsumptionConfig for inputs and OutputDatasetConfig, PipelineOutputAbstractDataset, and PipelineData for inputs and outputs.

Below is an example of using Dataset as a step input and output:


   from azureml.core import Dataset
   from azureml.pipeline.steps import PythonScriptStep
   from azureml.pipeline.core import Pipeline, PipelineData

   # get input dataset
   input_ds = Dataset.get_by_name(workspace, 'weather_ds')

   # register pipeline output as dataset
   output_ds = PipelineData('prepared_weather_ds', datastore=datastore).as_dataset()
   output_ds = output_ds.register(name='prepared_weather_ds', create_new_version=True)

   # configure pipeline step to use dataset as the input and output
   prep_step = PythonScriptStep(script_name="prepare.py",
                                inputs=[input_ds.as_named_input('weather_ds')],
                                outputs=[output_ds],
                                compute_target=compute_target,
                                source_directory=project_folder)

Please reference the corresponding documentation pages for examples of using other input/output types.

Methods

create_node

Create a node for PythonScriptStep and add it to the specified graph.

This method is not intended to be used directly. When a pipeline is instantiated with this step, Azure ML automatically passes the parameters required through this method so that step can be added to a pipeline graph that represents the workflow.

create_node

Create a node for PythonScriptStep and add it to the specified graph.

create_node(graph, default_datastore, context)

Parameters

Name	Description
graph Required	Graph The graph object to add the node to.
default_datastore Required	Union[AbstractAzureStorageDatastore, AzureDataLakeDatastore] The default datastore.
context Required	<xref:azureml.pipeline.core._GraphContext> The graph context.

Returns

Type	Description
Node	The created node.

PythonScriptStep Class

Constructor

Parameters

Remarks

Methods

create_node

Parameters

Returns

Feedback

Feedback

Additional resources