Hi Pan, John,
Thanks for reaching out to Microsoft Q&A.
The _temporary directory is created by spark during operations such as writing data to hdfs or other distributed file systems. It serves as a staging layer for intermediate data during map-reduce or shuffling tasks. usually once the job completed successfully, spark will move the data from _temporary to the final output directory. If your spark job fails or is interrupted it may not be cleaned up automatically and has to be tried to be removed manually.
Use standard file system commands... for ex: rm -r _temporary in unix-like systems (Ensure you have the necessary permissions to delete files in the directory)
So, when you are rerunning the job, the spark will automatically recreate the _temporary directory as needed.
Note: do not try deleting the _temporary directory while the spark job is still running.
For automated cleanup... try exploring the following:
- spark.cleaner.referenceTracking.blocking
spark.cleaner.periodicGC.interval
spark.cleaner.periodicGC.checkpointInterval
Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.