question

braxx avatar image
0 Votes"
braxx asked braxx answered

writing to parquet creates empty blob

When writing to parquet, I am getting an extra empty file created alongside the folder with data.

93590-capture222.png




I do not need it, causing mess only.

Here are the commands I tried, and got this file in both.

 output_path = "/mnt/cointainer/folder/subfolder/sub_subfolder_" + currentdate
 ....
 childitems.write.mode('overwrite').parquet(output_path)


or

 output_path = "/mnt/cointainer/folder/subfolder/sub_subfolder_" + currentdate
 ....
 childitems.write.format("parquet").mode('overwrite').save(output_path)

How to get rid of this unwanted file?

azure-databricks
capture222.png (27.9 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

PRADEEPCHEEKATLA-MSFT avatar image
0 Votes"
PRADEEPCHEEKATLA-MSFT answered PRADEEPCHEEKATLA-MSFT commented

Hello @braxx,

Thanks for asking and using Microsoft Q&A.

This is an expected behaviour when run any spark job to create these files.

93907-image.png

Expected output:

93908-image.png

When DBIO transactional commit is enabled, metadata files starting with started<id> and committed<id> will accompany data files created by Apache Spark jobs. Generally you shouldn’t alter these files directly. Rather, use the VACUUM command to clean the files.

A combination of below three properties will help to disable writing all the transactional files which start with "_".


  1. We can disable the transaction logs of spark parquet write using


    spark.sql.sources.commitProtocolClass =
    org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol

This will help to disable the committed<TID> and started<TID> files but still _SUCCESS, _common_metadata and _metadata files will generate.

  1. We can disable the _common_metadata and _metadata files using

    parquet.enable.summary-metadata=false

  2. We can also disable the _SUCCESS file using

    mapreduce.fileoutputcommitter.marksuccessfuljobs=false

For more details, refer "Transactional Writes to Cloud Storage with DBIO" and "Stop Azure Databricks auto creating files" and "How do I prevent _success and _committed files in my write output?".

Hope this helps. Do let us know if you any further queries.


Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.


image.png (109.7 KiB)
image.png (111.3 KiB)
· 6
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thank you for the explanation. That's helpfull for sure although my case is slightly different.

You simply explained what is inside a folder created by databricks.

93966-capture31.png


I am ok with that and understand it. But now, If go one level up, outside the folder I see there is an empty blob with the same name as a folder. It is created alongside the folder, not inside. See on the screen, marked at yellow

93972-capture32.png


0 Votes 0 ·
capture31.png (33.0 KiB)
capture32.png (28.8 KiB)

Hello @braxx,

This looks strange. I could not find any files created outside the folder.

94304-adb-writeparquet.gif

In order to investigate further, could you please share the Databricks runtime version which you are using? And the sample dataset to repro your scenario?


0 Votes 0 ·
braxx avatar image braxx PRADEEPCHEEKATLA-MSFT ·

sure, appreciate your help.

steps to reproduce the issue:

  1. save attached json to blob storage
    94349-sample-json.txt

  2. mount blob storage to databricks

  3. run the attached script from notebook in databricks (adjust input and output folder). The script parse json and save it as parquet.

94367-sample-notebook.txt


here is what i suppose is a runtime version: DBR 6.4 | Spark 2.4.5 | Scala 2.11
but i think running this on different cluser cause the same issue.
What is weird, when i delete the empty blob, the whole folder is deleted also


0 Votes 0 ·
Show more comments
braxx avatar image
0 Votes"
braxx answered

Thank you for your effort. Really appreciate it. Here is a related thread. Also not solved. Would it be possible to report it as bug to investigate by product team etc?

databricks-dbutils-creates-empty-blob-files-for-az.html




Maybe it is related to how I mounted the container?

 storagename = "AAAA"
 containername = "BBBB"
 saskey = dbutils.secrets.get(scope = "CCCCC", key = "DDDD")
    
 dbutils.fs.mount(
   source="wasbs://" + containername + "@" + storagename + ".blob.core.windows.net/",
   mount_point = "/mnt/" + containername + "/",
   extra_configs = {"fs.azure.sas." + containername + "." + storagename + ".blob.core.windows.net":"" + saskey +""})


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.