question

JoeDuncan-2610 avatar image
2 Votes"
JoeDuncan-2610 asked hernandoZ-8172 commented

Preparing ML object detction dataset for deep learning in PyTorch or similar

The intent of what I'm trying to achieve is:

  1. Export data labelling project as a Dataset

  2. Consume the Dataset in a notebook (converting to a Pandas dataframe)

  3. Perform a custom train / test split that maintains particular file groupings

  4. Register the resulting training and testing dataframes as Datasets

  5. Use these Datasets to train and test a custom object detection model


I need help in preparing the data for that final step. I'm familiar with different deep learning libraries, but have never implemented them in the Azure environment before. I've managed to complete 1 to 4. For step 4, I ended up writing the data to csv files and uploading these to the datastore.

 # define path for training data file and create new delimited file
 train_path = './data/train.csv'
 train_dataframe.to_csv(train_path, sep = ';', index = False)
    
 # repeat for testing
 test_path = './data/test.csv'
 test_dataframe.to_csv(test_path, sep = ';', index = False)
    
 # get the datastore to upload prepared data
 datastore = Datastore.get(ws, datastore_name='learningdata')
    
 # upload the local files from src_dir to the target_path in datastore
 datastore.upload(src_dir='data', target_path='train-test', overwrite=True)
    
 # create and register training dataset from datastore files
 training_ds = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-test/train.csv')], separator=';')
 training_ds = training_ds.register(workspace=ws, name = 'train', description = 'training dataset sampled from labelled data', create_new_version=True)
    
 # create and register testing dataset from datastore files
 testing_ds = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-test/test.csv')], separator=';')
 testing_ds = testing_ds.register(workspace=ws, name = 'test', description = 'testing dataset sampled from labelled data', create_new_version=True)

The approach I was intending to use for step 5 was to use to_torchvision() to convert it into a Torchvision dataset. This doesn't work, I receive the following error:

 UserErrorException: UserErrorException:
  Message: Cannot perform torchvision conversion on dataset without labeled columns defined
  InnerException None
  ErrorResponse 
 {
     "error": {
         "code": "UserError",
         "message": "Cannot perform torchvision conversion on dataset without labeled columns defined"
     }
 }

I suspect that the issue has to do with DataTypes. The original Dataset (exported from the data labelling project) has the DataTypes displayed below. By comparison, all column types in the train and test Datasets are parsed as strings. From my understanding, there's no way to convert to these data types.

  • image_url = Stream

  • label = List

  • label_confidence = List

Any advice on how to prepare this dataset for use in PyTorch or recommendation for an alternative approach would be greatly appreciated.




Update as per comment below:

  • I'm currently mounting the dataframe rather than downloading it due to data size.

  • I can view images from the originally mounted Dataset, but when loading the newly registered training Dataset I can't access images as '/tmp/tmpog809x4v/[...].jpg' is no longer relevant.

  • I can't perform random split because I'm using clustered sampling.

  • I'm working on creating a class object to define the dataset, but I cannot currently create the PIL Image object as required by PyTorch (https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html#defining-the-dataset)

azure-machine-learning
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

JoeDuncan-2610 avatar image
0 Votes"
JoeDuncan-2610 answered ramr-msft commented

I modified the methodology and was able to successfully resolve this issue as follows:

  1. Export data labelling project as Dataset

  2. Consume the Dataset in the notebook by creating both a PyTorch dataset and a Pandas dataframe

  3. Use the Pandas dataframe to determine indices for the train / test split based on required sampling

  4. Use the indices as an input to torch.utils.data.Subset() to split the PyTorch dataset into train and test

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@JoeDuncan-2610 Great, Thanks for sharing the update.

0 Votes 0 ·
ramr-msft avatar image
0 Votes"
ramr-msft answered hernandoZ-8172 commented

@JoeDuncan-2610 Thanks for the great question. End-to-end image detection that leverages training/test datasets created from a Data Labeling project. you are well aware that you can also ‘solve’ this problem with CustomVision, but I’d like to showcase how a custom vision problem which may not be handle well enough by Custom Vision could be handled easily with Azure ML with full control of the underlying ML algorithms and the power of Data Labeling.

The best practices to get back to the images referenced by the dataset, i.e. leverage the DataStore / StreamInfo from the TabularDataset extracted DataFrame, to prepare the data for a model training.

This code here that I put together is probably the way to proceed to retrieve the original image assets from a labeled TabularDataset.

 # azureml-core of version 1.0.72 or higher is required
 # azureml-contrib-dataset of version 1.0.72 or higher is required
    
    
 from azureml.core import Workspace, Dataset, Datastore
 import azureml.contrib.dataset
 import azureml.dataprep.native
     
 subscription_id = '_set_it_to_yours_'
 resource_group = '_set_it_to_yours_'
 workspace_name = '_set_it_to_yours_'
     
 workspace = Workspace(subscription_id, resource_group, workspace_name)
     
 # get dataset and extract as a DataFrame
 ds = Dataset.get_by_name(workspace, name=_set_it_to_yours_')
 df = ds.to_pandas_dataframe()
     
 # download images
 index = 0
 datastore = None
 while index < len(df):
     # image_url is a azureml.dataprep.native.StreamInfo object, convert to dict with to_pod()
     si = df.loc[index].image_url.to_pod()
     if index == 0:
         # retrieve datastore based on metadata from first row
         # assuming all images come from the same store
         # since they come from a single dataset
         datastore = Datastore.get(workspace, si['arguments']['datastoreName'])
     # download image locally
     datastore.download(target_path='.',prefix=si['resourceIdentifier'],overwrite=True,show_progress=True)
     index += 1
     
 # create training, test sets
 [training, test] = ds.random_split(0.8)

build model based on image assets and labels...
From there, build your train_x,y and test_x,y datasets…


We have checked in a sample notebook about labeled dataset to public github repo. You can find it here:
47267-image.png
https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/work-with-data/datasets-tutorial/labeled-datasets/labeled-datasets.ipynb



image.png (46.3 KiB)
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks @ramr-msft. I'm familiar with that notebook. A few clarifications from me:

  • I'm currently mounting the dataframe rather than downloading it due to data size.

  • I can view images from the originally mounted Dataset, but when loading the newly registered training Dataset I can't access images as '/tmp/tmpog809x4v/[...].jpg' is no longer relevant.

  • I can't perform random split because I'm using clustered sampling.

  • I'm working on creating a class object to define the dataset, but I cannot currently create the PIL Image object as required by PyTorch (https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html#defining-the-dataset)

Can I point imread() directly to an image without downloading it if I can specify the datastore and the relative path?

0 Votes 0 ·

@JoeDuncan-2610 Thanks for the update.

0 Votes 0 ·

Hi Sample notebook repo is not longer available , can you please share the new location.

Thanks

0 Votes 0 ·