PipelineOutputFileDataset Class

Represents intermediate pipeline data promoted to an Azure Machine Learning File Dataset.

Once an intermediate data is promoted to an Azure Machine Learning Dataset, it will also be consumed as a Dataset instead of a DataReference in subsequent steps.

Create an intermediate data that will be promoted to an Azure Machine Learning Dataset.

Inheritance
PipelineOutputFileDataset

Constructor

PipelineOutputFileDataset(pipeline_data)

Parameters

pipeline_data
PipelineData
Required

The PipelineData that represents the intermediate output which will be promoted to a Dataset.

pipeline_data
PipelineData
Required

The PipelineData that represents the intermediate output which will be promoted to a Dataset.

Methods

as_direct

Set input the consumption mode of the dataset to direct.

In this mode, you will get the ID of the dataset and in your script you can call Dataset.get_by_id to retrieve the dataset. run.input_datasets['{dataset_name}'] will return the Dataset.

as_download

Set the consumption mode of the dataset to download.

as_mount

Set the consumption mode of the dataset to mount.

parse_delimited_files

Transform the intermediate file dataset to a tabular dataset.

The tabular dataset is created by parsing the delimited file(s) pointed to by the intermediate output.

parse_parquet_files

Transform the intermediate file dataset to a tabular dataset.

The tabular dataset is created by parsing the parquet file(s) pointed to by the intermediate output.

as_direct

Set input the consumption mode of the dataset to direct.

In this mode, you will get the ID of the dataset and in your script you can call Dataset.get_by_id to retrieve the dataset. run.input_datasets['{dataset_name}'] will return the Dataset.

as_direct()

Returns

The modified PipelineOutputDataset.

Return type

as_download

Set the consumption mode of the dataset to download.

as_download(path_on_compute=None)

Parameters

path_on_compute
str
default value: None

The path on the compute to download the dataset to. Defaults to None, which means Azure Machine Learning picks a path for you.

Returns

The modified PipelineOutputDataset.

Return type

as_mount

Set the consumption mode of the dataset to mount.

as_mount(path_on_compute=None)

Parameters

path_on_compute
str
default value: None

The path on the compute to mount the dataset to. Defaults to None, which means Azure Machine Learning picks a path for you.

Returns

The modified PipelineOutputDataset.

Return type

parse_delimited_files

Transform the intermediate file dataset to a tabular dataset.

The tabular dataset is created by parsing the delimited file(s) pointed to by the intermediate output.

parse_delimited_files(include_path=False, separator=',', header=PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS, partition_format=None, file_extension='', set_column_types=None, quoted_line_breaks=False)

Parameters

include_path
bool
default value: False

Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.

separator
str
default value: ,

The separator used to split columns.

header
PromoteHeadersBehavior
default value: PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS

Controls how column headers are promoted when reading from files. Defaults to assume that all files have the same header.

partition_format
str
default value: None

Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.csv' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.csv' 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

file_extension
str
Required

The file extension of the files to read. Only files with this extension will be read from the directory. Default value is '.csv' when the separator is ',' and '.tsv' when the separator is tab, and None otherwise. If None is passed, all files will be read regardless of their extension (or lack of extension).

set_column_types
dict[str, DataType]
default value: None

A dictionary to set column data type, where key is column name and value is DataType. Columns not in the dictionary will remain of type string. Passing None will result in no conversions. Entries for columns not found in the source data will not cause an error and will be ignored.

quoted_line_breaks
bool
default value: False

Whether to handle new line characters within quotes. This option can impact performance.

Returns

Returns an intermediate data that will be a tabular dataset.

Return type

Remarks

This transformation will only be applied when the intermediate data is consumed as the input of the subsequent step. It has no effect on the output even if it is passed to the output.

parse_parquet_files

Transform the intermediate file dataset to a tabular dataset.

The tabular dataset is created by parsing the parquet file(s) pointed to by the intermediate output.

parse_parquet_files(include_path=False, partition_format=None, file_extension='.parquet', set_column_types=None)

Parameters

include_path
bool
default value: False

Boolean to keep path information as column in the dataset. Defaults to False. This is useful when reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.

partition_format
str
default value: None

Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.parquet' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.parquet' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

file_extension
str
default value: .parquet

The file extension of the files to read. Only files with this extension will be read from the directory. Default value is '.parquet'. If this is set to None, all files will be read regardless their extension (or lack of extension).

set_column_types
dict[str, DataType]
default value: None

A dictionary to set column data type, where key is column name and value is DataType. Columns not in the dictionary will remain of type loaded from the parquet file. Passing None will result in no conversions. Entries for columns not found in the source data will not cause an error and will be ignored.

Returns

Returns an intermediate data that will be a tabular dataset.

Return type

Remarks

This transformation will only be applied when the intermediate data is consumed as the input of the subsequent step. It has no effect on the output even if it is passed to the output.