question

AdrianAnticoTEKsystemsInc-1526 avatar image
0 Votes"
AdrianAnticoTEKsystemsInc-1526 asked romungi-MSFT answered

How can I transfer a csv file on an Azure Machine Learning compute instance directory back to the Datastore?

I posted a similar question last week and didn't get a response to that yet so I'm posting another one now.

The code below is what I use to pull data into the compute instance from the Datastore. I transfer data from a Datastore to the compute instance and then save the data to my directory as a csv. The data originates from a SCOPE script and is transferred from Cosmos to the Datastore via Azure Data Factory.

Once the data is in the directory as a csv, I then utilize R to pull in the data into an RStudio session and then I run various tasks that create new data sets. I also save these new data sets to the compute instance directory as csv's. These new data sets are the ones I'd like to push back to the Datastore so they can be transferred elsewhere via Azure Data Factory and later consumed by a PowerBI app we're looking to create.

I tried using Designer and it ran for 4 days without completing before I cancelled the job and started looking for an alternative route. I don't know if it would have completed or if it ran into memory issues and simply didn't fail. When I pull data into the compute instance from the datastore it takes less than a few minutes to complete so I'm not sure why it would take Designer multiple days to attempt to do the reverse operation.

I've looked through a bunch of documentation and I am not able to find anything that tells us how we can transfer data from the compute instance back to the Datastore aside from Designer which is too slow or unable to handle.

This task seems like one that should be obvious for use and a major selling point of Azure Machine Learning so I'm a bit dumbfounded to see that this is a challenge figuring out how to do and that the documentation doesn't clearly show users how to achieve this task, assuming it's even possible. If it's not possible then I need to figure out a whole new system to use to get my work done. If it's not possible, the Azure Machine Learning team should enable this functionality as soon as possible.

# Azure management
from azureml.core import Workspace, Dataset

# MetaData
subscription_id = '09b5fdb3-165d-4e2b-8ca0-34f998d176d5'
resource_group = 'xCloudData'
workspace_name = 'xCloudML'

# Create workspace 
workspace = Workspace(subscription_id, resource_group, workspace_name)

# 1. Retention_Engagement_CombinedData
dataset = Dataset.get_by_name(workspace, name='retention-engagement-combineddata')

# Save data to file
df = dataset.to_pandas_dataframe()
df.to_csv('/mnt/batch/tasks/shared/LS_root/mounts/clusters/v-aantico1/code/RetentionEngagement_CombinedData.csv')

# 2. TitleNameJoin
dataset = Dataset.get_by_name(workspace, name='TitleForJoiningInR')

# Save data to file
df = dataset.to_pandas_dataframe()
df.to_csv('/mnt/batch/tasks/shared/LS_root/mounts/clusters/v-aantico1/code/TitleNameJoin.csv')
azure-machine-learning
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

romungi-MSFT avatar image
1 Vote"
romungi-MSFT answered

@AdrianAnticoTEKsystemsInc-1526 Have you tried the following to upload data to your datastore?

 from azureml.core import Workspace
 ws = Workspace.from_config()
 datastore = ws.get_default_datastore()
    
 datastore.upload(src_dir='./data',
                  target_path='datasets/',
                  overwrite=True)

I think datastore.upload() should work for you to upload the required datafiles from your compute instance to datastore.



· 4
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

No luck yet...

I set up the code like you showed. When I created the datastore I tried running the code you provided and received the error:
"AttributeError: 'AzureDataLakeDatastore' object has no attribute 'upload'."
Should I be using a different method for defining the datastore object?
I proceeded to look into the docs:

I clicked the link you provided (https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py#azureml_core_datastore_Datastore_register_azure_blob_container)
and it took me to the AzureBlobStorage class page.

At the top, it said that I can't use the AzureBlobStorage class directly and it pointed me to the AbstractAzureStorageDatastore class page (https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.azure_storage_datastore.abstractazurestoragedatastore?view=azure-ml-py#upload-src-dir--target-path-none--overwrite-false--show-progress-true-). It did, however, have an upload method but I couldn't get it to work, "NotImplementedError".

At the top of the AbstractAzureStorageDatastore page, it said that I can't use this class directly and it pointed me to the Datastore class page (https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore.datastore?view=azure-ml-py#azureml_core_datastore_Datastore_register_azure_blob_container). It did, however, have an upload method but I couldn't get it to work, "NotImplementedError".

On the Datastore class page there is no upload method available.

0 Votes 0 ·
AdrianAnticoTEKsystemsInc-1526 avatar image AdrianAnticoTEKsystemsInc-1526 AdrianAnticoTEKsystemsInc-1526 ·

We figured it out. We had to switch the default datastore to an Azure Blob Storage.

Thanks for your help with this!

1 Vote 1 ·
romungi-MSFT avatar image romungi-MSFT AdrianAnticoTEKsystemsInc-1526 ·

The default datastore of a ML workspace is a storage account and the above method I mentioned will upload the files to this default storage account of the workspace. Are you trying to upload the files to a Azure data lake datastore?

In that case, the datalake datastore does not support upload method as documented here.

The AzureDataLakeGen2 class does not provide upload method, recommended way to uploading data to AzureDataLakeGen2 datastores is via Dataset upload. More details could be found at : https://docs.microsoft.com/azure/machine-learning/how-to-create-register-datasets

So, the upload datastore method to the storage container should be used to update files, this storage container should have been used for datalake registration to reflect the updated files.


0 Votes 0 ·

@romungi-MSFT

This is what is returned when I run the get_default_datastore(). Is there another method to get a different datastore?

datastore = ws.get_default_datastore()
print(datastore)
<azureml.data.azure_data_lake_datastore.AzureDataLakeDatastore object at 0x7ff489b9f588>

0 Votes 0 ·