writing to parquet creates empty blob

Question

When writing to parquet, I am getting an extra empty file created alongside the folder with data.

I do not need it, causing mess only.

Here are the commands I tried, and got this file in both.

output_path = "/mnt/cointainer/folder/subfolder/sub_subfolder_" + currentdate  
....  
childitems.write.mode('overwrite').parquet(output_path)

or

output_path = "/mnt/cointainer/folder/subfolder/sub_subfolder_" + currentdate  
....  
childitems.write.format("parquet").mode('overwrite').save(output_path)

How to get rid of this unwanted file?

Answer

Hello @braxx ,

Thanks for asking and using Microsoft Q&A.

This is an expected behaviour when run any spark job to create these files.

Expected output:

When DBIO transactional commit is enabled, metadata files starting with started and committed will accompany data files created by Apache Spark jobs. Generally you shouldn’t alter these files directly. Rather, use the VACUUM command to clean the files.

A combination of below three properties will help to disable writing all the transactional files which start with "_".

We can disable the transaction logs of spark parquet write using

spark.sql.sources.commitProtocolClass =
org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol

This will help to disable the committed and started files but still _SUCCESS, _common_metadata and _metadata files will generate.

We can disable the _common_metadata and _metadata files using

parquet.enable.summary-metadata=false

We can also disable the _SUCCESS file using

mapreduce.fileoutputcommitter.marksuccessfuljobs=false

For more details, refer "Transactional Writes to Cloud Storage with DBIO" and "Stop Azure Databricks auto creating files" and "How do I prevent _success and _committed files in my write output?".

Hope this helps. Do let us know if you any further queries.

------------

Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.

Answer

Thank you for your effort. Really appreciate it. Here is a related thread. Also not solved. Would it be possible to report it as bug to investigate by product team etc?

databricks-dbutils-creates-empty-blob-files-for-az.html

Maybe it is related to how I mounted the container?

storagename = "AAAA"
containername = "BBBB"
saskey = dbutils.secrets.get(scope = "CCCCC", key = "DDDD")

dbutils.fs.mount(
  source="wasbs://" + containername + "@" + storagename + ".blob.core.windows.net/",
  mount_point = "/mnt/" + containername + "/",
  extra_configs = {"fs.azure.sas." + containername + "." + storagename + ".blob.core.windows.net":"" + saskey +""})

writing to parquet creates empty blob

2 answers