writing to parquet creates empty blob

braxx 426 Reputation points
2021-05-04T15:38:53.033+00:00

When writing to parquet, I am getting an extra empty file created alongside the folder with data.

93590-capture222.png

I do not need it, causing mess only.

Here are the commands I tried, and got this file in both.

output_path = "/mnt/cointainer/folder/subfolder/sub_subfolder_" + currentdate  
....  
childitems.write.mode('overwrite').parquet(output_path)  

or

output_path = "/mnt/cointainer/folder/subfolder/sub_subfolder_" + currentdate  
....  
childitems.write.format("parquet").mode('overwrite').save(output_path)  

How to get rid of this unwanted file?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,942 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 77,901 Reputation points Microsoft Employee
    2021-05-05T10:31:51.377+00:00

    Hello @braxx ,

    Thanks for asking and using Microsoft Q&A.

    This is an expected behaviour when run any spark job to create these files.

    93907-image.png

    Expected output:

    93908-image.png

    When DBIO transactional commit is enabled, metadata files starting with started<id> and committed<id> will accompany data files created by Apache Spark jobs. Generally you shouldn’t alter these files directly. Rather, use the VACUUM command to clean the files.

    A combination of below three properties will help to disable writing all the transactional files which start with "_".

    1. We can disable the transaction logs of spark parquet write using

    spark.sql.sources.commitProtocolClass =
    org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol

    This will help to disable the committed<TID> and started<TID> files but still _SUCCESS, _common_metadata and _metadata files will generate.

    1. We can disable the _common_metadata and _metadata files using

    parquet.enable.summary-metadata=false

    1. We can also disable the _SUCCESS file using

    mapreduce.fileoutputcommitter.marksuccessfuljobs=false

    For more details, refer "Transactional Writes to Cloud Storage with DBIO" and "Stop Azure Databricks auto creating files" and "How do I prevent _success and _committed files in my write output?".

    Hope this helps. Do let us know if you any further queries.

    ------------

    Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.


  2. braxx 426 Reputation points
    2021-05-26T14:06:57.307+00:00

    Thank you for your effort. Really appreciate it. Here is a related thread. Also not solved. Would it be possible to report it as bug to investigate by product team etc?

    databricks-dbutils-creates-empty-blob-files-for-az.html

    Maybe it is related to how I mounted the container?

    storagename = "AAAA"
    containername = "BBBB"
    saskey = dbutils.secrets.get(scope = "CCCCC", key = "DDDD")
    
    dbutils.fs.mount(
      source="wasbs://" + containername + "@" + storagename + ".blob.core.windows.net/",
      mount_point = "/mnt/" + containername + "/",
      extra_configs = {"fs.azure.sas." + containername + "." + storagename + ".blob.core.windows.net":"" + saskey +""})
    
    0 comments No comments