FileDataset class

Definition

Represents a collection of file references in datastores or public URLs to use in Azure Machine Learning.

A FileDataset defines a series of lazily-evaluated, immutable operations to load data from the data source into file streams. Data is not loaded from the source until FileDataset is asked to deliver data.

A FileDataset is created using the from_files(path, validate=True) method of the FileDatasetFactory class.

For more information, see the article Add & register datasets. To get started working with a file dataset, see https://aka.ms/filedataset-samplenotebook.

FileDataset()
Inheritance
builtins.object
FileDataset

Remarks

FileDataset can be used as input of an experiment run. It can also be registered to workspace with a specified name and be retrieved by that name later.

FileDataset can be subsetted by invoking different subsetting methods available on this class. The result of subsetting is always a new FileDataset.

The actual data loading happens when FileDataset is asked to deliver the data into another storage mechanism (e.g. files downloaded or mounted to local path).

Methods

download(target_path=None, overwrite=False)

Download file streams defined by the dataset as local files.

mount(mount_point=None)

Create a context manager for mounting file streams defined by the dataset as local files.

random_split(percentage, seed=None)

Split file streams in the dataset into two parts randomly and approximately by the percentage specified.

skip(count)

Skip file streams from the top of the dataset by the specified count.

take(count)

Take a sample of file streams from top of the dataset by the specified count.

take_sample(probability, seed=None)

Take a random sample of file streams in the dataset approximately by the probability specified.

to_path()

Get a list of file paths for each file stream defined by the dataset.

download(target_path=None, overwrite=False)

Download file streams defined by the dataset as local files.

download(target_path=None, overwrite=False)

Parameters

target_path
str

The local directory to download the files to. If None, the data will be downloaded into a temporary directory.

overwrite
bool

Indicates whether to overwrite existing files. The default is False. Existing files will be overwritten if overwrite is set to True; otherwise an exception will be raised.

Returns

Returns an array of file paths for each file downloaded.

Return type

Remarks

If target_path starts with a /, then it will be treated as an absolute path. If it doesn't start with a /, then it will be treated as a relative path relative to the current working directory.

mount(mount_point=None)

Create a context manager for mounting file streams defined by the dataset as local files.

mount(mount_point=None)

Parameters

mount_point
str

The local directory to mount the files to. If None, the data will be mounted into a temporary directory, which you can find by calling the MountContext.mount_point instance method.

Returns

Returns a context manager for managing the lifecycle of the mount.

Return type

Remarks

A context manager will be returned to manage the lifecycle of the mount. To mount, you will need to enter the context manager and to unmount, exit from the context manager.

Mount is only supported on Unix or Unix-like operating systems and libfuse must be present. If you are running inside a docker container, the docker container must be started with the --privileged flag or started with --cap-add SYS_ADMIN --device /dev/fuse.


       datastore = Datastore.get(workspace, 'workspaceblobstore')
       dataset = Dataset.File.from_files((datastore, 'animals/dog/year-*/*.jpg'))

       with dataset.mount() as mount_context:
           # list top level mounted files and folders in the dataset
           os.listdir(mount_context.mount_point)

       # You can also use the start and stop methods
       mount_context = dataset.mount()
       mount_context.start()  # this will mount the file streams
       mount_context.stop()  # this will unmount the file streams

   If target_path starts with a /, then it will be treated as an absolute path. If it doesn't start
   with a /, then it will be treated as a relative path relative to the current working directory.

random_split(percentage, seed=None)

Split file streams in the dataset into two parts randomly and approximately by the percentage specified.

random_split(percentage, seed=None)

Parameters

percentage
float

The approximate percentage to split the Dataset by. This must be a number between 0.0 and 1.0.

seed
int

An optional seed to use for the random generator.

Returns

Returns a tuple of new FileDataset objects representing the two datasets after the split.

Return type

skip(count)

Skip file streams from the top of the dataset by the specified count.

skip(count)

Parameters

count
int

The number of file streams to skip.

Returns

Returns a new FileDataset object representing a dataset with file streams skipped.

Return type

take(count)

Take a sample of file streams from top of the dataset by the specified count.

take(count)

Parameters

count
int

The number of file streams to take.

Returns

Returns a new FileDataset object representing the sampled dataset.

Return type

take_sample(probability, seed=None)

Take a random sample of file streams in the dataset approximately by the probability specified.

take_sample(probability, seed=None)

Parameters

probability
float

The probability of a file stream being included in the sample.

seed
int

An optional seed to use for the random generator.

Returns

Returns a new FileDataset object representing the sampled dataset.

Return type

to_path()

Get a list of file paths for each file stream defined by the dataset.

to_path()

Returns

Returns an array of file paths.

Return type

Remarks

The file paths are relative paths for local files when the file streams are downloaded or mounted.

A common prefix will be removed from the file paths based on how data source was specified to create the dataset. For example:


   datastore = Datastore.get(workspace, 'workspaceblobstore')
   dataset = Dataset.File.from_files((datastore, 'animals/dog/year-*/*.jpg'))
   print(dataset.to_path())

   # ['year-2018/1.jpg'
   #  'year-2018/2.jpg'
   #  'year-2019/1.jpg']

   dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/green-small/*.csv')

   print(dataset.to_path())
   # ['/green_tripdata_2013-08.csv']