OutputFileDatasetConfig Class

Represent how to copy the output of a run and be promoted as a FileDataset.

The OutputFileDatasetConfig allows you to specify how you want a particular local path on the compute target to be uploaded to the specified destination. If no arguments are passed to the constructor, we will automatically generate a name, a destination, and a local path.

An example of not passing any arguments:


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   output = OutputFileDatasetConfig()

   script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])

   run = experiment.submit(script_run_config)
   print(run)

An example of creating an output then promoting the output to a tabular dataset and register it with name foo:


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   datastore = Datastore(workspace, 'example_adls_gen2_datastore')

   # for more information on the parameters and methods, please look for the corresponding documentation.
   output = OutputFileDatasetConfig().read_delimited_files().register_on_complete('foo')

   script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])

   run = experiment.submit(script_run_config)
   print(run)

Initialize a OutputFileDatasetConfig.

The OutputFileDatasetConfig allows you to specify how you want a particular local path on the compute target to be uploaded to the specified destination. If no arguments are passed to the constructor, we will automatically generate a name, a destination, and a local path.

An example of not passing any arguments:


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   output = OutputFileDatasetConfig()

   script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])

   run = experiment.submit(script_run_config)
   print(run)

An example of creating an output then promoting the output to a tabular dataset and register it with name foo:


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   datastore = Datastore(workspace, 'example_adls_gen2_datastore')

   # for more information on the parameters and methods, please look for the corresponding documentation.
   output = OutputFileDatasetConfig().read_delimited_files().register_on_complete('foo')

   script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])

   run = experiment.submit(script_run_config)
   print(run)
Inheritance
OutputFileDatasetConfig
OutputFileDatasetConfig

Constructor

OutputFileDatasetConfig(name=None, destination=None, source=None, partition_format=None)

Parameters

name
str
Required

The name of the output specific to this run. This is generally used for lineage purposes. If set to None, we will automatically generate a name. The name will also become an environment variable which contains the local path of where you can write your output files and folders to that will be uploaded to the destination.

destination
tuple
Required

The destination to copy the output to. If set to None, we will copy the output to the workspaceblobstore datastore, under the path /dataset/{run-id}/{output-name}, where run-id is the Run's ID and the output-name is the output name from the name parameter above. The destination is a tuple where the first item is the datastore and the second item is the path within the datastore to copy the data to.

The path within the datastore can be a template path. A template path is just a regular path but with placeholders inside. Those placeholders will then be resolved at the appropriate time. The syntax for placeholders is {placeholder}, for example, /path/with/{placeholder}. Currently only two placeholders are supported, {run-id} and {output-name}.

source
str
Required

The path within the compute target to copy the data from. If set to None, we will set this to a directory we create inside the compute target's OS temporary directory.

partition_format
str
Required

Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

name
str
Required

The name of the output specific to this run. This is generally used for lineage purposes. If set to None, we will automatically generate a name. The name will also become an environment variable which contains the local path of where you can write your output files and folders to that will be uploaded to the destination.

destination
tuple
Required

The destination to copy the output to. If set to None, we will copy the output to the workspaceblobstore datastore, under the path /dataset/{run-id}/{output-name}, where run-id is the Run's ID and the output-name is the output name from the name parameter above. The destination is a tuple where the first item is the datastore and the second item is the path within the datastore to copy the data to.

The path within the datastore can be a template path. A template path is just a regular path but with placeholders inside. Those placeholders will then be resolved at the appropriate time. The syntax for placeholders is {placeholder}, for example, /path/with/{placeholder}. Currently only two placeholders are supported, {run-id} and {output-name}.

source
str
Required

The path within the compute target to copy the data from. If set to None, we will set this to a directory we create inside the compute target's OS temporary directory.

partition_format
str
Required

Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

Remarks

You can pass the OutputFileDatasetConfig as an argument to your run, and it will be automatically translated into local path on the compute. The source argument will be used if one is specified, otherwise we will automatically generate a directory in the OS's temp folder. The files and folders inside the source directory will then be copied to the destination based on the output configuration.

By default the mode by which the output will be copied to the destination storage will be set to mount. For more information about mount mode, please see the documentation for as_mount.

Methods

as_input

Specify how to consume the output as an input in subsequent pipeline steps.

as_mount

Set the mode of the output to mount.

For mount mode, the output directory will be a FUSE mounted directory. Files written to the mounted directory will be uploaded when the file is closed.

as_upload

Set the mode of the output to upload.

For upload mode, files written to the output directory will be uploaded at the end of the job. If the job fails or gets canceled, then the output directory will not be uploaded.

as_input

Specify how to consume the output as an input in subsequent pipeline steps.

as_input(name=None)

Parameters

name
str
Required

The name of the input specific to the run.

Returns

A DatasetConsumptionConfig instance describing how to deliver the input data.

Return type

as_mount

Set the mode of the output to mount.

For mount mode, the output directory will be a FUSE mounted directory. Files written to the mounted directory will be uploaded when the file is closed.

as_mount(disable_metadata_cache=False)

Parameters

disable_metadata_cache
bool
Required

Whether to cache metadata in local node, if disabled a node will not be able to see files generated from other nodes during job running.

Returns

A OutputFileDatasetConfig instance with mode set to mount.

Return type

as_upload

Set the mode of the output to upload.

For upload mode, files written to the output directory will be uploaded at the end of the job. If the job fails or gets canceled, then the output directory will not be uploaded.

as_upload(overwrite=False, source_globs=None)

Parameters

overwrite
bool
Required

Whether to overwrite files that already exists in the destination.

source_globs
list[str]
Required

Glob patterns used to filter files that will be uploaded.

Returns

A OutputFileDatasetConfig instance with mode set to upload.

Return type