question

acedola avatar image
0 Votes"
acedola asked acedola commented

Creating datasets in Azure Machine Learning service from more than 100 paths

Hi,

I need to create a dataset in Azure Machine Learning service from an Azure Data Lake Gen2 registered as a Datastore. Data in the lake are 1000's of avro files stored by an Event Hub Capture following the pattern [EventHub]/[Partition]/[YYYY]/[MM]/[DD]/[HH]/[mm]/[ss], so there is one path for each file.

According to the datasets documentation it is recommended "... creating dataset referencing less than 100 paths in datastores for optimal performance."

What would be the alternative/recommended approach in my application? Streaming data are continuously captured by the Event Hub.

Thanks


azure-machine-learningazure-data-lake-storage
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi, thanks for reaching out. I am working with the product team internally to determine if there is a workaround. Will share updates as soon as I have more information. Thanks.

0 Votes 0 ·

@GiftA-MSFT


Thank you so much, I'll be waiting news ;)


Ariel


0 Votes 0 ·

1 Answer

MayHu-3433 avatar image
0 Votes"
MayHu-3433 answered acedola commented

Hi,

You can create dataset with globing pattern.
ds = Dataset.File.from_files((datastore, '[EventHub]/[Partition]/**))

The mount time should be less than 1 min.

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@MayHu-3433


Thank you for your response. Yes, I'm already using this to create my dataset. The point is to find the best performant method as data volume grows continuously.


Ariel


0 Votes 0 ·