question

Matzof avatar image
1 Vote"
Matzof asked ramr-msft edited

AzureML Notebooks: how to access data from an experiment

Hi, I am new to Azure ML, and I have been trying to replicate the same structure presented in the MNIST tutorial, but I don't understand how to adapt it to my case.

I am running a python file from the experiment, but I don't understand how I can access data that is currently in a folder in the cloud file system from the script running in the experiment.
I have found many examples about accessing one single .csv file, but my data is made of many images.

From my understanding I should first load the folder to a datastore, then use Dataset.File.upload_directory to create a dataset containing my folder, and here is how I tried to do it:

 # Create dataset from data directory
 datastore = Datastore.get(ws, 'workspaceblobstore')
 dataset = Dataset.File.upload_directory(path_data, target, pattern=None, overwrite=False, show_progress=True)
    
 file_dataset = dataset.register(workspace=ws, name='reduced_classification_dataset',
                                                  description='reduced_classification_dataset',
                                                  create_new_version=True)

But then I don't understand if and how I can access this data like a normal file system from my python script, or I need further steps to be able to do that.

azure-machine-learningazure-open-datasets
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

ramr-msft avatar image
0 Votes"
ramr-msft answered ramr-msft edited

@Matzof Thanks for the question. Please follow the below code for writing.

    datastore = ## get your defined in Workspace as Datastore 
 datastore.upload(src_dir='./files/to/copy/...',
                  target_path='target/directory',
                  overwrite=True)

Datastore.upload only support blob and fileshare. For adlsgen2 upload, you can try our new dataset upload API:


 from azureml.core import Dataset, Datastore
 datastore = Datastore.get(workspace, 'mayadlsgen2')
 Dataset.File.upload_directory(src_dir='./data', target=(datastore,'data'))


Pandas is integrated with fsspec which provides Pythonic implementation for filesystems including s3, gcs, and Azure. You can check the source for Azure here: dask/adlfs: fsspec-compatible Azure Datake and Azure Blob Storage access (github.com). With this you can use normal filesystem operations like ls, glob, info, etc.

You can find an example (for reading data) here: azureml-examples/1.intro-to-dask.ipynb at main · Azure/azureml-examples (github.com)

Writing is essentially the same as reading, you need to switch the protocol to abfs (or az), slightly modify how you're accessing the data, and provide credentials unless your blob has public write access.

You can use the Azure ML Datastore to retrieve credentials like this (taken from example):


74112-2.png



2.png (52.8 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.