OutputFileDatasetConfig Class

Reference

Represent how to copy the output of a run and be promoted as a FileDataset.

The OutputFileDatasetConfig allows you to specify how you want a particular local path on the compute target to be uploaded to the specified destination. If no arguments are passed to the constructor, we will automatically generate a name, a destination, and a local path.

An example of not passing any arguments:


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   output = OutputFileDatasetConfig()

   script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])

   run = experiment.submit(script_run_config)
   print(run)

An example of creating an output then promoting the output to a tabular dataset and register it with name foo:


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   datastore = Datastore(workspace, 'example_adls_gen2_datastore')

   # for more information on the parameters and methods, please look for the corresponding documentation.
   output = OutputFileDatasetConfig().read_delimited_files().register_on_complete('foo')

   script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])

   run = experiment.submit(script_run_config)
   print(run)

Initialize a OutputFileDatasetConfig.

An example of not passing any arguments:


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   output = OutputFileDatasetConfig()

   script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])

   run = experiment.submit(script_run_config)
   print(run)

An example of creating an output then promoting the output to a tabular dataset and register it with name foo:


   workspace = Workspace.from_config()
   experiment = Experiment(workspace, 'output_example')

   datastore = Datastore(workspace, 'example_adls_gen2_datastore')

   # for more information on the parameters and methods, please look for the corresponding documentation.
   output = OutputFileDatasetConfig().read_delimited_files().register_on_complete('foo')

   script_run_config = ScriptRunConfig('.', 'train.py', arguments=[output])

   run = experiment.submit(script_run_config)
   print(run)

Inheritance: OutputDatasetConfig

OutputFileDatasetConfig

TransformationMixin

OutputFileDatasetConfig

Constructor

OutputFileDatasetConfig(name=None, destination=None, source=None, partition_format=None)

Parameters

name: str

Required

The name of the output specific to this run. This is generally used for lineage purposes. If set to None, we will automatically generate a name. The name will also become an environment variable which contains the local path of where you can write your output files and folders to that will be uploaded to the destination.

destination: tuple

Required

The destination to copy the output to. If set to None, we will copy the output to the workspaceblobstore datastore, under the path /dataset/{run-id}/{output-name}, where run-id is the Run's ID and the output-name is the output name from the name parameter above. The destination is a tuple where the first item is the datastore and the second item is the path within the datastore to copy the data to.

The path within the datastore can be a template path. A template path is just a regular path but with placeholders inside. Those placeholders will then be resolved at the appropriate time. The syntax for placeholders is {placeholder}, for example, /path/with/{placeholder}. Currently only two placeholders are supported, {run-id} and {output-name}.

source: str

Required

The path within the compute target to copy the data from. If set to None, we will set this to a directory we create inside the compute target's OS temporary directory.

partition_format: str

Required

Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

name: str

Required

destination: tuple

Required

source: str

Required

The path within the compute target to copy the data from. If set to None, we will set this to a directory we create inside the compute target's OS temporary directory.

partition_format: str

Required

Remarks

You can pass the OutputFileDatasetConfig as an argument to your run, and it will be automatically translated into local path on the compute. The source argument will be used if one is specified, otherwise we will automatically generate a directory in the OS's temp folder. The files and folders inside the source directory will then be copied to the destination based on the output configuration.

By default the mode by which the output will be copied to the destination storage will be set to mount. For more information about mount mode, please see the documentation for as_mount.

Methods

as_input

Specify how to consume the output as an input in subsequent pipeline steps.

as_mount

Set the mode of the output to mount.

For mount mode, the output directory will be a FUSE mounted directory. Files written to the mounted directory will be uploaded when the file is closed.

as_upload

Set the mode of the output to upload.

For upload mode, files written to the output directory will be uploaded at the end of the job. If the job fails or gets canceled, then the output directory will not be uploaded.

as_input

Specify how to consume the output as an input in subsequent pipeline steps.

as_input(name=None)

Parameters

name: str

Required

The name of the input specific to the run.

Returns

A DatasetConsumptionConfig instance describing how to deliver the input data.

Return type

DatasetConsumptionConfig

as_mount

Set the mode of the output to mount.

For mount mode, the output directory will be a FUSE mounted directory. Files written to the mounted directory will be uploaded when the file is closed.

as_mount(disable_metadata_cache=False)

Parameters

disable_metadata_cache: bool

Required

Whether to cache metadata in local node, if disabled a node will not be able to see files generated from other nodes during job running.

Returns

A OutputFileDatasetConfig instance with mode set to mount.

Return type

OutputFileDatasetConfig

as_upload

Set the mode of the output to upload.

For upload mode, files written to the output directory will be uploaded at the end of the job. If the job fails or gets canceled, then the output directory will not be uploaded.

as_upload(overwrite=False, source_globs=None)

Parameters

overwrite: bool

Required

Whether to overwrite files that already exists in the destination.

source_globs: list[str]

Required

Glob patterns used to filter files that will be uploaded.

Returns

A OutputFileDatasetConfig instance with mode set to upload.

Return type

OutputFileDatasetConfig

OutputFileDatasetConfig Class

Constructor

Parameters

Remarks

Methods

as_input

Parameters

Returns

Return type

as_mount

Parameters

Returns

Return type

as_upload

Parameters

Returns

Return type

Feedback

Feedback

Additional resources