proper cleanup of spark _temporary directory after an exception

Question

We have a data collection app that periodically queries some data source, then writes the result to ADLS:

df.write.mode('append').option('compression', 'gzip').parquet(rawDbPath)

Every other week, the data source does some maintanence work at midnight. During the maintanence window, data source returned our query with some junk data. It triggered exception, which our application captured and handled. However, the exception still caused the running thread to crash with following error:

24/04/19 00:26:47 WARN FileUtil: Failed to delete file or dir [/srv/gsdb/data/****/2024/4/19.parquet/_temporary/0/_temporary/attempt_202404190017509146436462045514343_56279_m_000001_255849]: it still exists.

The _temporary directory was created by Spark. What is the proper way to handle (or delete it)?

Answer

Hi Pan, John,

Thanks for reaching out to Microsoft Q&A.

The _temporary directory is created by spark during operations such as writing data to hdfs or other distributed file systems. It serves as a staging layer for intermediate data during map-reduce or shuffling tasks. usually once the job completed successfully, spark will move the data from _temporary to the final output directory. If your spark job fails or is interrupted it may not be cleaned up automatically and has to be tried to be removed manually.

Use standard file system commands... for ex: rm -r _temporary in unix-like systems (Ensure you have the necessary permissions to delete files in the directory)

So, when you are rerunning the job, the spark will automatically recreate the _temporary directory as needed.

Note: do not try deleting the _temporary directory while the spark job is still running.

For automated cleanup... try exploring the following:

spark.cleaner.referenceTracking.blocking

spark.cleaner.periodicGC.interval

spark.cleaner.periodicGC.checkpointInterval

Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

proper cleanup of spark _temporary directory after an exception

1 answer