Can we append data to an existing csv file stored in Azure blob storage?

Senthil Murugan RAMACHANDRAN 21 Reputation points
2021-03-11T04:46:29+00:00

I have a machine learning model deployed in azure designer studio. I need to retrain it everyday with new data through python code. I need to keep the existing csv data in the blob storage and also add some more data to the existing csv and retrain it. If I retrain the model with only the new data, the old data is lost so I need to retrain the model by appending new data to existing data. Is there any way to do it through python coding?

I have also researched about append blob but they add only in the end of the blob. In the documentation, they have mentioned we cannot update or add to an existing blob.

Any help is appreciated. Thanks a lot.

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,558 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. romungi-MSFT 41,866 Reputation points Microsoft Employee
    2021-03-11T10:11:06.027+00:00

    @Senthil Murugan RAMACHANDRAN The best practice with respect to Azure Machine learning is to register your dataset and version it if you would like to retrain it to create a new model. You can infact have multiple csv files in your storage and create a single tabular dataset from the files. For example:

    Here we are using files from a blob container which are placed at different times and registering the dataset with versioning. If you would like to add more file, you can simply add more csv files to the web path and then register a new version or use the older versions again if required.

    # create a TabularDataset from Titanic training data  
    web_paths = ['https://dprepdata.blob.core.windows.net/demo/Titanic.csv',  
                 'https://dprepdata.blob.core.windows.net/demo/Titanic2.csv']  
    titanic_ds = Dataset.Tabular.from_delimited_files(path=web_paths)  
      
    # create a new version of titanic_ds  
    titanic_ds = titanic_ds.register(workspace = workspace,  
                                     name = 'titanic_ds',  
                                     description = 'new titanic training data',  
                                     create_new_version = True)