proper cleanup of spark _temporary directory after an exception

Pan, John 120 Reputation points
2024-04-19T21:54:08.9333333+00:00

We have a data collection app that periodically queries some data source, then writes the result to ADLS:

                df.write.mode('append').option('compression', 'gzip').parquet(rawDbPath)

Every other week, the data source does some maintanence work at midnight. During the maintanence window, data source returned our query with some junk data. It triggered exception, which our application captured and handled. However, the exception still caused the running thread to crash with following error:

24/04/19 00:26:47 WARN FileUtil: Failed to delete file or dir [/srv/gsdb/data/****/2024/4/19.parquet/_temporary/0/_temporary/attempt_202404190017509146436462045514343_56279_m_000001_255849]: it still exists.

The _temporary directory was created by Spark. What is the proper way to handle (or delete it)?

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,349 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247-1375 11,211 Reputation points
    2024-04-20T11:34:51.59+00:00

    Hi Pan, John,

    Thanks for reaching out to Microsoft Q&A.

    The _temporary directory is created by spark during operations such as writing data to hdfs or other distributed file systems. It serves as a staging layer for intermediate data during map-reduce or shuffling tasks. usually once the job completed successfully, spark will move the data from _temporary to the final output directory. If your spark job fails or is interrupted it may not be cleaned up automatically and has to be tried to be removed manually.

    Use standard file system commands... for ex: rm -r _temporary in unix-like systems (Ensure you have the necessary permissions to delete files in the directory)

    So, when you are rerunning the job, the spark will automatically recreate the _temporary directory as needed.

    Note: do not try deleting the _temporary directory while the spark job is still running.

    For automated cleanup... try exploring the following:

    • spark.cleaner.referenceTracking.blocking

    spark.cleaner.periodicGC.interval

    spark.cleaner.periodicGC.checkpointInterval

    Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

    0 comments No comments