question

Samyak-3746 avatar image
0 Votes"
Samyak-3746 asked vipullag-MSFT edited

How to read multiple parquet.gzip files incrementally into pandas from Azure blob storage?

Hi all,

I want to read multiple parquet.gzip files incrementally into a pandas dataframe from my blob storage, do manipulation on them and store them using python. How can this be done effectively?
Note: Tried to read them directly using pd.read_parquet but i guess it doesn't work that way in Azure.
Can you guys help me out with a code snippet?

azure-data-factoryazure-blob-storage
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @Samyak-3746,

Following up to see if the below suggestion was helpful. And, if you have any further query do let us know.


  • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you.

0 Votes 0 ·

1 Answer

PRADEEPCHEEKATLA-MSFT avatar image
0 Votes"
PRADEEPCHEEKATLA-MSFT answered PRADEEPCHEEKATLA-MSFT commented

Hello @Samyak-3746,

Thanks for the question and using MS Q&A platform.

When I tired to read multiple parquet.gzip files using pandas got it error message: OSError: Could not open parquet input source '<Buffer>': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

201279-image.png

After bit of research, found this document - Azure Databricks - Zip Files which explains to unzip the files and then load the files directly.

You can invoke the Azure Databricks %sh zip magic command to unzip the file and read using pandas as shown below:

201289-image.png

Hope this will help. Please let us know if any further queries.


  • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how

  • Want a reminder to come back and check responses? Here is how to subscribe to a notification

  • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators


image.png (37.5 KiB)
image.png (173.5 KiB)
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @PRADEEPCHEEKATLA-MSFT
Thanks for the response. But what I'm doing is uploading a python script with a parquet dataset on blob storage and then triggering that script using ADF and using Batch service for computation. When i trigger the script and read the parquet file, it requires pyarrow. This is where I am stuck and need help with. Also, if in future i need to install certain dependencies then i would like to know how that is to be done.

0 Votes 0 ·

Hello @Samyak-3746,

Thanks for sharing additional details.

Could you please share the python script which you are using along with the error message which you are experiencing?

0 Votes 0 ·

Hello @Samyak-3746,

Just checking in if you have had a chance to see the previous response. We need the following information to understand/investigate this issue further.

0 Votes 0 ·