FileDatasetFactory Class

Contains methods to create a file dataset for Azure Machine Learning.

A FileDataset is created from the from_files method defined in this class.

For more information on working with file datasets, see the notebook https://aka.ms/filedataset-samplenotebook.

Inheritance
builtins.object
FileDatasetFactory

Methods

from_files

Create a FileDataset to represent file streams.

upload_directory

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Create a dataset from source directory.

from_files

Create a FileDataset to represent file streams.

from_files(path, validate=True, partition_format=None)

Parameters

path
Union[str, list[str], <xref:azureml.data.datapath.DataPath,builtin.list>[DataPath], (Datastore, str)<xref:,builtin.list>[(Datastore, str)]]

The path to the source files, which can be single value or list of http url string, DataPath object, or tuple of Datastore and relative path.

validate
bool

Indicates whether to validate if data can be loaded from the returned dataset. Defaults to True. Validation requires that the data source is accessible from the current compute.

partition_format
str

Specify the partition format of path. Defaults to None. The partition information of each path will be extracted into columns based on the specified format. Format part '{column_name}' creates string column, and '{column_name:yyyy/MM/dd/HH/mm/ss}' creates datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute and second for the datetime type. The format should start from the position of first partition key until the end of file path. For example, given the path '../Accounts/2019/01/01/data.jsonl' where the partition is by department name and time, partition_format='/{Department}/{PartitionDate:yyyy/MM/dd}/data.jsonl' creates a string column 'Department' with the value 'Accounts' and a datetime column 'PartitionDate' with the value '2019-01-01'.

Returns

A FileDataset object.

Return type

Remarks

from_files creates an object of FileDataset class, which defines the operations to load file streams from the provided path.

For the data to be accessible by Azure Machine Learning, the files specified by path must be located in a Datastore or be accessible with public web URLs.

Creating dataset from url of Blob, ADLS Gen1 and ADLS Gen2 are supported now (Preview), users' AAD token will be used in notebook or local python program if it directly calls one of these functions: FileDataset.mount FileDataset.download FileDataset.to_path TabularDataset.to_pandas_dataframe TabularDataset.to_dask_dataframe TabularDataset.to_spark_dataframe TabularDataset.to_parquet_files TabularDataset.to_csv_files the identity of the compute target will be used in jobs submitted by Experiment.submit for data access authentication. Learn more: https://aka.ms/data-access


   from azureml.core import Dataset, Datastore

   # create file dataset from a single file in datastore
   datastore = Datastore.get(workspace, 'workspaceblobstore')
   file_dataset_1 = Dataset.File.from_files(path=(datastore,'image/dog.jpg'))

   # create file dataset from a single directory in datastore
   file_dataset_2 = Dataset.File.from_files(path=(datastore, 'image/'))

   # create file dataset from all jpeg files in the directory
   file_dataset_3 = Dataset.File.from_files(path=(datastore,'image/**/*.jpg'))

   # create filedataset from multiple paths
   data_paths = [(datastore, 'image/dog.jpg'), (datastore, 'image/cat.jpg')]
   file_dataset_4 = Dataset.File.from_files(path=data_paths)

   # create file dataset from url
   file_dataset_5 = Dataset.File.from_files(path='https://url/image/cat.jpg')

upload_directory

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Create a dataset from source directory.

upload_directory(src_dir, target, pattern=None, overwrite=False, show_progress=True)

Parameters

src_dir
str

The local directory to upload.

target
Union[DataPath, <xref:azureml.core.datastore.Datastore,tuple>(Datastore, str)]

Required, the datastore path where the files will be uploaded to.

pattern
str

Optional, If provided, will filter all the path names matching the given pattern, similar to Python glob package, supporting '*', '?', and character ranges expressed with [].

show_progress
bool

Optional, indicates whether to show progress of the upload in the console. Defaults to be True.

Returns

The registered dataset.

Return type