question

PingXiao-2145 avatar image
0 Votes"
PingXiao-2145 asked PRADEEPCHEEKATLA-MSFT commented

Why can't set spark.cleaner.referenceTracking.cleanCheckpoints in Databricks

I am using datafame.checkpoint(). I'd like to set spark.cleaner.referenceTracking.cleanCheckpoints to true. When I use spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true'), I got "Cannot modify the value of a Spark config: spark.cleaner.referenceTracking.cleanCheckpoints". Why is it so? What's the default value for spark.cleaner.referenceTracking.cleanCheckpoints in Databricks?

azure-databricks
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

PRADEEPCHEEKATLA-MSFT avatar image
1 Vote"
PRADEEPCHEEKATLA-MSFT answered PRADEEPCHEEKATLA-MSFT commented

Hello @PingXiao-2145,

Welcome to the Microsoft Q&A platform.

Note: Some of the spark properties that should be setup on cluster start.

By default, spark.cleaner.referenceTracking.cleanCheckpoints is set to false.

If you want to set spark.cleaner.referenceTracking.cleanCheckpoints is set to true. you should set it on the Spark Config under Advanced Options in the cluster configuration.

112161-image.png

Before and after configuring the Spark configuration under Advanced options in the cluster:

112125-image.png

Hope this helps. Do let us know if you any further queries.


Please "Accept the answer" if the information helped you. This will help us and others in the community as well.


image.png (57.4 KiB)
image.png (52.8 KiB)
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello PRADEEPCHEEKATLA-MSFT, thanks a lot for your quick and detailed reply. I'm going to implement this.

A followup question: how do I know the checkpoint files are indeed removed. In Databricks, I tried to %sh ls the checkpoint dir specified me, but got "the dir doesn't exist" (tho the checkpointing was successful). It seems Databricks doesn't allow me to access the checkpoint files. Then, how can I know the files are cleaned? As context, I'm using checkpoint of dataframe

0 Votes 0 ·

Hello @PingXiao-2145,

Glad to know above answer helped.

For the followup question, please do create a new thread with the details about the tried cmdlet and the screenshot of the error message.


Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.

0 Votes 0 ·

Hello @PingXiao-2145,

Did you get a chance to accept it as answer( by clicking on the 113724-image.png)?. This can be beneficial to other community members. Thank you.


1 Vote 1 ·