FileDataset Class

Reference

Represents a collection of file references in datastores or public URLs to use in Azure Machine Learning.

A FileDataset defines a series of lazily-evaluated, immutable operations to load data from the data source into file streams. Data is not loaded from the source until FileDataset is asked to deliver data.

A FileDataset is created using the from_files method of the FileDatasetFactory class.

For more information, see the article Add & register datasets. To get started working with a file dataset, see https://aka.ms/filedataset-samplenotebook.

Initialize the FileDataset object.

This constructor is not supposed to be invoked directly. Dataset is intended to be created using FileDatasetFactory class.

Inheritance: AbstractDataset

FileDataset

Constructor

FileDataset()

Remarks

FileDataset can be used as input of an experiment run. It can also be registered to workspace with a specified name and be retrieved by that name later.

FileDataset can be subsetted by invoking different subsetting methods available on this class. The result of subsetting is always a new FileDataset.

The actual data loading happens when FileDataset is asked to deliver the data into another storage mechanism (e.g. files downloaded or mounted to local path).

Methods

as_cache	Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Create a DatacacheConsumptionConfig mapped to a datacache_store and a dataset.
as_download	Create a DatasetConsumptionConfig with the mode set to download. In the submitted run, files in the dataset will be downloaded to local path on the compute target. The download location can be retrieved from argument values and the input_datasets field of the run context. We will automatically generate an input name. If you would like specify a custom input name, please call the as_named_input method. # Given a run submitted with dataset input like this: dataset_input = dataset.as_download() experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input])) # Following are sample codes running in context of the submitted run: # The download location can be retrieved from argument values import sys download_location = sys.argv[1] # The download location can also be retrieved from input_datasets of the run context. from azureml.core import Run download_location = Run.get_context().input_datasets['input_1']
as_hdfs	Set the mode to hdfs. In the submitted synapse run, files in the datasets will be converted to local path on the compute target. The hdfs path can be retrieved from argument values and the os environment variables. `# Given a run submitted with dataset input like this: dataset_input = dataset.as_hdfs() experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input])) # Following are sample codes running in context of the submitted run: # The hdfs path can be retrieved from argument values import sys hdfs_path = sys.argv[1] # The hdfs path can also be retrieved from input_datasets of the run context. import os hdfs_path = os.environ['input_<hash>']`
as_mount	Create a DatasetConsumptionConfig with the mode set to mount. In the submitted run, files in the datasets will be mounted to local path on the compute target. The mount point can be retrieved from argument values and the input_datasets field of the run context. We will automatically generate an input name. If you would like specify a custom input name, please call the as_named_input method. `# Given a run submitted with dataset input like this: dataset_input = dataset.as_mount() experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input])) # Following are sample codes running in context of the submitted run: # The mount point can be retrieved from argument values import sys mount_point = sys.argv[1] # The mount point can also be retrieved from input_datasets of the run context. from azureml.core import Run mount_point = Run.get_context().input_datasets['input_1']`
download	Download file streams defined by the dataset as local files.
file_metadata	Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Get file metadata expression by specifying the metadata column name. Supported file metadata columns are Size, LastModifiedTime, CreationTime, Extension and CanSeek
filter	Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Filter the data, leaving only the records that match the specified expression.
hydrate	Note This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Hydrate the dataset into the requested replicas specified in datacache_store.
mount	Create a context manager for mounting file streams defined by the dataset as local files.
random_split	Split file streams in the dataset into two parts randomly and approximately by the percentage specified. The first dataset returned contains approximately `percentage` of the total number of file references and the second dataset contains the remaining file references.
skip	Skip file streams from the top of the dataset by the specified count.
take	Take a sample of file streams from top of the dataset by the specified count.
take_sample	Take a random sample of file streams in the dataset approximately by the probability specified.
to_path	Get a list of file paths for each file stream defined by the dataset.

as_cache

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Create a DatacacheConsumptionConfig mapped to a datacache_store and a dataset.

as_cache(datacache_store)

Parameters

Name	Description
datacache_store Required	DatacacheStore The datacachestore to be used to hydrate.

Returns

Type	Description
DatacacheConsumptionConfig	The configuration object describing how the datacache should be materialized in the run.

as_download

Create a DatasetConsumptionConfig with the mode set to download.

In the submitted run, files in the dataset will be downloaded to local path on the compute target. The download location can be retrieved from argument values and the input_datasets field of the run context. We will automatically generate an input name. If you would like specify a custom input name, please call the as_named_input method.


   # Given a run submitted with dataset input like this:
   dataset_input = dataset.as_download()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The download location can be retrieved from argument values
   import sys
   download_location = sys.argv[1]

   # The download location can also be retrieved from input_datasets of the run context.
   from azureml.core import Run
   download_location = Run.get_context().input_datasets['input_1']

as_download(path_on_compute=None)

Parameters

Name	Description
path_on_compute	str The target path on the compute to make the data available at. default value: None

Remarks

When the dataset is created from path of a single file, the download location will be path of the single downloaded file. Otherwise, the download location will be path of the enclosing folder for all the downloaded files.

If path_on_compute starts with a /, then it will be treated as an absolute path. If it doesn't start with a /, then it will be treated as a relative path relative to the working directory. If you have specified an absolute path, please make sure that the job has permission to write to that directory.

as_hdfs

Set the mode to hdfs.

In the submitted synapse run, files in the datasets will be converted to local path on the compute target. The hdfs path can be retrieved from argument values and the os environment variables.


   # Given a run submitted with dataset input like this:
   dataset_input = dataset.as_hdfs()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The hdfs path can be retrieved from argument values
   import sys
   hdfs_path = sys.argv[1]

   # The hdfs path can also be retrieved from input_datasets of the run context.
   import os
   hdfs_path = os.environ['input_<hash>']

as_hdfs()

Remarks

When the dataset is created from path of a single file, the hdfs path will be path of the single file. Otherwise, the hdfs path will be path of the enclosing folder for all the mounted files.

as_mount

Create a DatasetConsumptionConfig with the mode set to mount.

In the submitted run, files in the datasets will be mounted to local path on the compute target. The mount point can be retrieved from argument values and the input_datasets field of the run context. We will automatically generate an input name. If you would like specify a custom input name, please call the as_named_input method.


   # Given a run submitted with dataset input like this:
   dataset_input = dataset.as_mount()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The mount point can be retrieved from argument values
   import sys
   mount_point = sys.argv[1]

   # The mount point can also be retrieved from input_datasets of the run context.
   from azureml.core import Run
   mount_point = Run.get_context().input_datasets['input_1']

as_mount(path_on_compute=None)

Parameters

Name	Description
path_on_compute	str The target path on the compute to make the data available at. default value: None

Remarks

When the dataset is created from path of a single file, the mount point will be path of the single mounted file. Otherwise, the mount point will be path of the enclosing folder for all the mounted files.

download

Download file streams defined by the dataset as local files.

download(target_path=None, overwrite=False, ignore_not_found=False)

Parameters

Name	Description
target_path Required	str The local directory to download the files to. If None, the data will be downloaded into a temporary directory.
overwrite Required	bool Indicates whether to overwrite existing files. The default is False. Existing files will be overwritten if overwrite is set to True; otherwise an exception will be raised.
ignore_not_found Required	bool Indicates whether to fail download if some files pointed to by dataset are not found. The default is False. Download will fail if any file download fails for any reason if ignore_not_found is set to False; otherwise a waring will be logged for not found errors and dowload will succeed as long as no other error types are encountered.

Returns

Type	Description
list(str)	Returns an array of file paths for each file downloaded.

Remarks

If target_path starts with a /, then it will be treated as an absolute path. If it doesn't start with a /, then it will be treated as a relative path relative to the current working directory.

file_metadata

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Get file metadata expression by specifying the metadata column name.

Supported file metadata columns are Size, LastModifiedTime, CreationTime, Extension and CanSeek

file_metadata(col)

Parameters

Name	Description
col Required	str Name of column

Returns

Type	Description
<xref:azureml.dataprep.api.expression.RecordFieldExpression>	Returns an expression that retrieves the value in the specified column.

filter

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Filter the data, leaving only the records that match the specified expression.

filter(expression)

Parameters

Name	Description
expression Required	<xref:azureml.dataprep.api.expression.Expression> The expression to evaluate.

Returns

Type	Description
FileDataset	The modified dataset (unregistered).

Remarks

Expressions are started by indexing the Dataset with the name of a column. They support a variety of functions and operators and can be combined using logical operators. The resulting expression will be lazily evaluated for each record when a data pull occurs and not where it is defined.


   (dataset.file_metadata('Size') > 10000) & (dataset.file_metadata('CanSeek') == True)
   dataset.file_metadata('Extension').starts_with('j')

hydrate

Note

This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Hydrate the dataset into the requested replicas specified in datacache_store.

hydrate(datacache_store, replica_count=None)

Parameters

Name	Description
datacache_store Required	DatacacheStore The datacachestore to be used to hydrate.
replica_count Required	<xref:Int>, <xref:optional> Number of replicas to hydrate.

Returns

Type	Description
DatacacheHydrationTracker	The configuration object describing how the datacache should be materialized in the run.

mount

Create a context manager for mounting file streams defined by the dataset as local files.

mount(mount_point=None, **kwargs)

Parameters

Name	Description
mount_point Required	str The local directory to mount the files to. If None, the data will be mounted into a temporary directory, which you can find by calling the MountContext.mount_point instance method.

Returns

Type	Description
<xref:MountContext>: <xref:the> <xref:context> <xref:manager.> <xref:Upon> <xref:entering> <xref:the> <xref:context> <xref:manager>, <xref:the> <xref:dataflow> <xref:will> <xref:be> <xref:mounted> <xref:to> <xref:the> <xref:mount_point.> <xref:Upon> exit, <xref:it> <xref:will> <xref:remove> <xref:the> mount <xref:point> <xref:and> clean <xref:up> <xref:the> <xref:daemon> <xref:process> <xref:used> <xref:to> mount <xref:the> <xref:dataflow.>	Returns a context manager for managing the lifecycle of the mount.

Type

Description

<xref:MountContext>: <xref:the> <xref:context> <xref:manager.> <xref:Upon> <xref:entering> <xref:the> <xref:context> <xref:manager>, <xref:the> <xref:dataflow> <xref:will> <xref:be> <xref:mounted> <xref:to> <xref:the> <xref:mount_point.> <xref:Upon> exit, <xref:it> <xref:will> <xref:remove> <xref:the> mount <xref:point> <xref:and> clean <xref:up> <xref:the> <xref:daemon> <xref:process> <xref:used> <xref:to> mount <xref:the> <xref:dataflow.>

Returns a context manager for managing the lifecycle of the mount.

Remarks

A context manager will be returned to manage the lifecycle of the mount. To mount, you will need to enter the context manager and to unmount, exit from the context manager.

Mount is only supported on Unix or Unix-like operating systems with the native package libfuse installed. If you are running inside a docker container, the docker container must be started with the –privileged flag or started with –cap-add SYS_ADMIN –device /dev/fuse.


   datastore = Datastore.get(workspace, 'workspaceblobstore')
   dataset = Dataset.File.from_files((datastore, 'animals/dog/year-*/*.jpg'))

   with dataset.mount() as mount_context:
       # list top level mounted files and folders in the dataset
       os.listdir(mount_context.mount_point)

   # You can also use the start and stop methods
   mount_context = dataset.mount()
   mount_context.start()  # this will mount the file streams
   mount_context.stop()  # this will unmount the file streams

If target_path starts with a /, then it will be treated as an absolute path. If it doesn't start with a /, then it will be treated as a relative path relative to the current working directory.

random_split

Split file streams in the dataset into two parts randomly and approximately by the percentage specified.

The first dataset returned contains approximately percentage of the total number of file references and the second dataset contains the remaining file references.

random_split(percentage, seed=None)

Parameters

Name	Description
percentage Required	float The approximate percentage to split the dataset by. This must be a number between 0.0 and 1.0.
seed Required	int An optional seed to use for the random generator.

Returns

Type	Description
(FileDataset, FileDataset)	Returns a tuple of new FileDataset objects representing the two datasets after the split.

skip

Skip file streams from the top of the dataset by the specified count.

skip(count)

Parameters

Name	Description
count Required	int The number of file streams to skip.

Returns

Type	Description
FileDataset	Returns a new FileDataset object representing a dataset with file streams skipped.

take

Take a sample of file streams from top of the dataset by the specified count.

take(count)

Parameters

Name	Description
count Required	int The number of file streams to take.

Returns

Type	Description
FileDataset	Returns a new FileDataset object representing the sampled dataset.

take_sample

Take a random sample of file streams in the dataset approximately by the probability specified.

take_sample(probability, seed=None)

Parameters

Name	Description
probability Required	float The probability of a file stream being included in the sample.
seed Required	int An optional seed to use for the random generator.

Returns

Type	Description
FileDataset	Returns a new FileDataset object representing the sampled dataset.

to_path

Get a list of file paths for each file stream defined by the dataset.

to_path()

Returns

Type	Description
list(str)	Returns an array of file paths.

Remarks

The file paths are relative paths for local files when the file streams are downloaded or mounted.

A common prefix will be removed from the file paths based on how data source was specified to create the dataset. For example:


   datastore = Datastore.get(workspace, 'workspaceblobstore')
   dataset = Dataset.File.from_files((datastore, 'animals/dog/year-*/*.jpg'))
   print(dataset.to_path())

   # ['year-2018/1.jpg'
   #  'year-2018/2.jpg'
   #  'year-2019/1.jpg']

   dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/green-small/*.csv')

   print(dataset.to_path())
   # ['/green_tripdata_2013-08.csv']

FileDataset Class

Constructor

Remarks

Methods

as_cache

Parameters

Returns

as_download

Parameters

Remarks

as_hdfs

Remarks

as_mount

Parameters

Remarks

download

Parameters

Returns

Remarks

file_metadata

Parameters

Returns

filter

Parameters

Returns

Remarks

hydrate

Parameters

Returns

mount

Parameters

Returns

Remarks

random_split

Parameters

Returns

skip

Parameters

Returns

take

Parameters

Returns

take_sample

Parameters

Returns

to_path

Returns

Remarks

Feedback

Feedback

Additional resources