PipelineOutputFileDataset Class

Reference

Represents intermediate pipeline data promoted to an Azure Machine Learning File Dataset.

Once an intermediate data is promoted to an Azure Machine Learning Dataset, it will also be consumed as a Dataset instead of a DataReference in subsequent steps.

Create an intermediate data that will be promoted to an Azure Machine Learning Dataset.

Inheritance: PipelineOutputAbstractDataset

PipelineOutputFileDataset

Constructor

PipelineOutputFileDataset(pipeline_data)

Parameters

Name	Description
pipeline_data Required	PipelineData The PipelineData that represents the intermediate output which will be promoted to a Dataset.
pipeline_data Required	PipelineData The PipelineData that represents the intermediate output which will be promoted to a Dataset.

Methods

as_direct	Set input the consumption mode of the dataset to direct. In this mode, you will get the ID of the dataset and in your script you can call Dataset.get_by_id to retrieve the dataset. run.input_datasets['{dataset_name}'] will return the Dataset.
as_download	Set the consumption mode of the dataset to download.
as_mount	Set the consumption mode of the dataset to mount.
parse_delimited_files	Transform the intermediate file dataset to a tabular dataset. The tabular dataset is created by parsing the delimited file(s) pointed to by the intermediate output.
parse_parquet_files	Transform the intermediate file dataset to a tabular dataset. The tabular dataset is created by parsing the parquet file(s) pointed to by the intermediate output.

as_direct

Set input the consumption mode of the dataset to direct.

In this mode, you will get the ID of the dataset and in your script you can call Dataset.get_by_id to retrieve the dataset. run.input_datasets['{dataset_name}'] will return the Dataset.

as_direct()

Returns

Type	Description
PipelineOutputFileDataset	The modified PipelineOutputDataset.

as_download

Set the consumption mode of the dataset to download.

as_download(path_on_compute=None)

Parameters

Name	Description
path_on_compute	str The path on the compute to download the dataset to. Defaults to None, which means Azure Machine Learning picks a path for you. default value: None

Returns

Type	Description
PipelineOutputFileDataset	The modified PipelineOutputDataset.

as_mount

Set the consumption mode of the dataset to mount.

as_mount(path_on_compute=None)

Parameters

Name	Description
path_on_compute	str The path on the compute to mount the dataset to. Defaults to None, which means Azure Machine Learning picks a path for you. default value: None

Returns

Type	Description
PipelineOutputFileDataset	The modified PipelineOutputDataset.

parse_delimited_files

Transform the intermediate file dataset to a tabular dataset.

The tabular dataset is created by parsing the delimited file(s) pointed to by the intermediate output.

parse_delimited_files(include_path=False, separator=',', header=PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS, partition_format=None, file_extension='', set_column_types=None, quoted_line_breaks=False)

Parameters

Name	Description
include_path	bool Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path. default value: False
separator	str The separator used to split columns. default value: ,
header	PromoteHeadersBehavior Controls how column headers are promoted when reading from files. Defaults to assume that all files have the same header. default value: PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS
partition_format	str Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.csv' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'. default value: None
file_extension Required	str The file extension of the files to read. Only files with this extension will be read from the directory. Default value is '.csv' when the separator is ',' and '.tsv' when the separator is tab, and None otherwise. If None is passed, all files will be read regardless of their extension (or lack of extension).
set_column_types	dict[str, DataType] A dictionary to set column data type, where key is column name and value is DataType. Columns not in the dictionary will remain of type string. Passing None will result in no conversions. Entries for columns not found in the source data will not cause an error and will be ignored. default value: None
quoted_line_breaks	bool Whether to handle new line characters within quotes. This option can impact performance. default value: False

Returns

Type	Description
PipelineOutputTabularDataset	Returns an intermediate data that will be a tabular dataset.

Remarks

This transformation will only be applied when the intermediate data is consumed as the input of the subsequent step. It has no effect on the output even if it is passed to the output.

parse_parquet_files

Transform the intermediate file dataset to a tabular dataset.

The tabular dataset is created by parsing the parquet file(s) pointed to by the intermediate output.

parse_parquet_files(include_path=False, partition_format=None, file_extension='.parquet', set_column_types=None)

Parameters

Name	Description
include_path	bool Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path. default value: False
partition_format	str Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'. default value: None
file_extension	str The file extension of the files to read. Only files with this extension will be read from the directory. Default value is '.parquet'. If this is set to None, all files will be read regardless their extension (or lack of extension). default value: .parquet
set_column_types	dict[str, DataType] A dictionary to set column data type, where key is column name and value is DataType. Columns not in the dictionary will remain of type loaded from the parquet file. Passing None will result in no conversions. Entries for columns not found in the source data will not cause an error and will be ignored. default value: None

Returns

Type	Description
PipelineOutputTabularDataset	Returns an intermediate data that will be a tabular dataset.

Remarks

This transformation will only be applied when the intermediate data is consumed as the input of the subsequent step. It has no effect on the output even if it is passed to the output.

PipelineOutputFileDataset Class

Constructor

Parameters

Methods

as_direct

Returns

as_download

Parameters

Returns

as_mount

Parameters

Returns

parse_delimited_files

Parameters

Returns

Remarks

parse_parquet_files

Parameters

Returns

Remarks

Feedback

Feedback

Additional resources