Is it possible to process PDF files with full path in Blob Storage without downloading them locally using Python?

Question

Hello,

I am trying to process multiple PDFs to an OCR program written in Python. During local development, PDFs are located in a local directory where can be processed but I wasn't able to figure it out a path-like filesystem in Blob Storage. Technically speaking, I know there is no such filesystem in Blob but I need such path to be passed in OCR program. Any ways we can achieve this? Thanks in advance.

I also asked a question in SO here but none of the solutions are actually working. Based on my search, I know there is a class called CloudBlobDirectory in C# which does what I want. Is there a similar way of doing this in Python V12 SDK? I also checked source code samples here but to no avail.

Answer

@uLi-9330 Copying the offline discussion which could benefit other community members who is reading this thread!

If you are on Linux then you can consider using BlobFuse. It will download the files though but they can use the blobs as files in a regular file system.
If they want to stream the blob, https://github.com/Azure/azure-sdk-for-python/blob/0e5934f5c30f6d535f203aabcd203cd74cdffb80/sdk/storage/azure-storage-file-datalake/samples/datalake_samples_upload_download.py#L84 the downloaded file is a StorageStreamDownloader object and you can do a readinto() on it. The sample is not manipulating the stream however, its just simply storing it in a txt file.

Please don’t forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Share via

Is it possible to process PDF files with full path in Blob Storage without downloading them locally using Python?

1 answer