PipelineOutputFileDataset Class

Reference

Represents intermediate pipeline data promoted to an Azure Machine Learning File Dataset.

Once an intermediate data is promoted to an Azure Machine Learning Dataset, it will also be consumed as a Dataset instead of a DataReference in subsequent steps.

Create an intermediate data that will be promoted to an Azure Machine Learning Dataset.

Inheritance: PipelineOutputAbstractDataset

PipelineOutputFileDataset

Constructor

PipelineOutputFileDataset(pipeline_data)

Parameters

pipeline_data: PipelineData

Required

The PipelineData that represents the intermediate output which will be promoted to a Dataset.

pipeline_data: PipelineData

Required

The PipelineData that represents the intermediate output which will be promoted to a Dataset.

Methods

as_direct	Set input the consumption mode of the dataset to direct. In this mode, you will get the ID of the dataset and in your script you can call Dataset.get_by_id to retrieve the dataset. run.input_datasets['{dataset_name}'] will return the Dataset.
as_download	Set the consumption mode of the dataset to download.
as_mount	Set the consumption mode of the dataset to mount.
parse_delimited_files	Transform the intermediate file dataset to a tabular dataset. The tabular dataset is created by parsing the delimited file(s) pointed to by the intermediate output.
parse_parquet_files	Transform the intermediate file dataset to a tabular dataset. The tabular dataset is created by parsing the parquet file(s) pointed to by the intermediate output.

as_direct

Set input the consumption mode of the dataset to direct.

In this mode, you will get the ID of the dataset and in your script you can call Dataset.get_by_id to retrieve the dataset. run.input_datasets['{dataset_name}'] will return the Dataset.

as_direct()

Returns

The modified PipelineOutputDataset.

Return type

PipelineOutputFileDataset

as_download

Set the consumption mode of the dataset to download.

as_download(path_on_compute=None)

Parameters

path_on_compute: str

default value: None

The path on the compute to download the dataset to. Defaults to None, which means Azure Machine Learning picks a path for you.

Returns

The modified PipelineOutputDataset.

Return type

PipelineOutputFileDataset

as_mount

Set the consumption mode of the dataset to mount.

as_mount(path_on_compute=None)

Parameters

path_on_compute: str

default value: None

The path on the compute to mount the dataset to. Defaults to None, which means Azure Machine Learning picks a path for you.

Returns

The modified PipelineOutputDataset.

Return type

PipelineOutputFileDataset

parse_delimited_files

Transform the intermediate file dataset to a tabular dataset.

The tabular dataset is created by parsing the delimited file(s) pointed to by the intermediate output.

parse_delimited_files(include_path=False, separator=',', header=PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS, partition_format=None, file_extension='', set_column_types=None, quoted_line_breaks=False)

Parameters

include_path: bool

default value: False

Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.

separator: str

default value: ,

The separator used to split columns.

header: PromoteHeadersBehavior

default value: PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS

Controls how column headers are promoted when reading from files. Defaults to assume that all files have the same header.

partition_format: str

default value: None

Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.csv' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

file_extension: str

Required

The file extension of the files to read. Only files with this extension will be read from the directory. Default value is '.csv' when the separator is ',' and '.tsv' when the separator is tab, and None otherwise. If None is passed, all files will be read regardless of their extension (or lack of extension).

set_column_types: dict[str, DataType]

default value: None

A dictionary to set column data type, where key is column name and value is DataType. Columns not in the dictionary will remain of type string. Passing None will result in no conversions. Entries for columns not found in the source data will not cause an error and will be ignored.

quoted_line_breaks: bool

default value: False

Whether to handle new line characters within quotes. This option can impact performance.

Returns

Returns an intermediate data that will be a tabular dataset.

Return type

PipelineOutputTabularDataset

Remarks

This transformation will only be applied when the intermediate data is consumed as the input of the subsequent step. It has no effect on the output even if it is passed to the output.

parse_parquet_files

Transform the intermediate file dataset to a tabular dataset.

The tabular dataset is created by parsing the parquet file(s) pointed to by the intermediate output.

parse_parquet_files(include_path=False, partition_format=None, file_extension='.parquet', set_column_types=None)

Parameters

include_path: bool

default value: False

partition_format: str

default value: None

Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

file_extension: str

default value: .parquet

The file extension of the files to read. Only files with this extension will be read from the directory. Default value is '.parquet'. If this is set to None, all files will be read regardless their extension (or lack of extension).

set_column_types: dict[str, DataType]

default value: None

A dictionary to set column data type, where key is column name and value is DataType. Columns not in the dictionary will remain of type loaded from the parquet file. Passing None will result in no conversions. Entries for columns not found in the source data will not cause an error and will be ignored.

Returns

Returns an intermediate data that will be a tabular dataset.

Return type

PipelineOutputTabularDataset

Remarks

This transformation will only be applied when the intermediate data is consumed as the input of the subsequent step. It has no effect on the output even if it is passed to the output.

PipelineOutputFileDataset Class

Constructor

Parameters

Methods

as_direct

Returns

Return type

as_download

Parameters

Returns

Return type

as_mount

Parameters

Returns

Return type

parse_delimited_files

Parameters

Returns

Return type

Remarks

parse_parquet_files

Parameters

Returns

Return type

Remarks

Feedback

Feedback

Additional resources