Convert File Dataset into a Dataframe to use in a pipeline

Question

Hello,

I would like to convert a file dataset into a dataframe using a python script to use the data in a pipeline. I need to use the file dataset as i want to train my model using the files and not the table.

Thank you!

Answer

@MEZIANE Yani I think you could try this to use the filedataset as pandas dataframe, download and use it for your experiment's training.

from azureml.core import Dataset  
from azureml.opendatasets import MNIST  
import pandas as pd  
import os  

data_folder = os.path.join(os.getcwd(), 'data')  
os.makedirs(data_folder, exist_ok=True)  
  
#Download the dataset  
mnist_file_dataset = MNIST.get_file_dataset()  
mnist_file_dataset.download(data_folder, overwrite=True)  

#Use the files in dataframe  
df = pd.DataFrame(mnist_file_dataset.to_path())  
print(df)  
  
#Register the dataset for training  
mnist_file_dataset = mnist_file_dataset.register(workspace=ws,  
                                                 name='mnist_opendataset',  
                                                 description='training and test dataset',  
                                                 create_new_version=True)

Answer

My aim is to run a pipeline (pre-process data and tune model hyperparameters) that I already have with design using as input data not each row of a table as it does with a tabular dataset but rather for each CVS file that represents an object (its information with a lot of rows) as input since the random selection per frame is amplifying the performance of the model. I have the data as tabular and files in a dataset. I have managed to get the path of each cvs file; but cannot read them as part of a new dataframe. I have the data in a datastore and dataset, so I don’t know if to accomplish this I should store the data elsewhere (I have not been working long with Azure so I am not acquainted with all the storage possibilities and the interactions between these and the ML studio.)

I manage to do this in python with the following code:
listOfFile = os.listdir(path)
for file in listOfFile:
new_well=pd.read_csv(os.path.join(path,file))

And in Azure this is as far as I have gotten without result:

  ds = Dataset.get_by_name(ws, name='well files')
 ds.download(data_folder, overwrite=True)
 df = pd.DataFrame(ds.to_path())
 df= dirr+df
files = pd.DataFrame(df)
well = map(pd.read_csv, files)

but I cannot use this output of well into the design pipeline due to being class map.

Thank you very much for your help. It is greatly appreciated as I really have no clue whatsoever on how to proceed or solve this.

Answer

@romungi-MSFT

Is there a way to do this with multiples .cvs documents?
I have a folder full of cvs files I need to read, is there a way to give the path of the folder and for the program to read all of the cvs files within that folder?
There are a lot so not really feasible to do them one by one.

Answer

Ok I managed with this very simple line:

                     tabular_dataset_3 = Dataset.Tabular.from_delimited_files(path=(datastore,'weather/**/*.csv'))

However, I’m afraid this will not help me accomplish my objective as all the files are now in the same tabular dataset and now, I need the training of a model to be done considering the files and not all the rows, meaning that there will be random selection per frame and not per document as I desired. I need to pre-process the data and split the training and test dataset based on the csv documents, not on a table containing all the data points.

Thank you for your help!

Convert File Dataset into a Dataframe to use in a pipeline

4 answers