question

CourtneyHaedke-0265 avatar image
1 Vote"
CourtneyHaedke-0265 asked Felix-5198 commented

Start up Synapse Spark Pool Once in ADF

Hello,

I am using Azure Synapse Analytics and I have set up 1 single Spark Pool.
I created 4 notebooks in Synapse and would like to schedule them to run in one single ADF pipeline.
The problem is that it takes about 4-5 minutes to run each notebook in ADF totaling about 20 minutes.
When I run the notebooks through synapse studio their run time is 30-60 seconds each.
The reason why the notebooks take so long to run in ADF is because the Spark Cluster restarts in each of the ADF Notebook activities.
Is there a way in ADF to start up the Synapse spark pool once and then run each note book on the already started spark pool?

Thanks,
Courtney

azure-synapse-analytics
· 5
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi Courtney,

Synapse Pipelines does have built in integration for notebooks, but ADF currently does not. Can you share a little more detail on how you are triggering the notebooks in ADF?

0 Votes 0 ·

Hi Samara,

Sorry for the delayed response.
The notebooks are being triggered within the actual pipe line.
There is a blob storage event trigger when ever a new file arrives, depending on the file name

1) First Blob Storage Trigger detects the file

2) Depending on the file a name a switch activity will determine which spark notebook process to run
79285-image.png


3) Then the switch case statement will kick off the notebook

79272-image.png


0 Votes 0 ·
image.png (4.0 KiB)
image.png (3.7 KiB)

Thank you for clarifying .

Are each of the notebooks assigned to the same spark pool?

I will make internal inquiries, .

1 Vote 1 ·
Show more comments
MartinJaffer-MSFT avatar image
0 Votes"
MartinJaffer-MSFT answered Felix-5198 commented

@CourtneyHaedke-0265

If you set the "Auto termination" time to be long enough, the same cluster should be used for consecutive jobs in the same pipeline.

· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.


MartinJaffer-MSFT,
I realize this is a year later however, I have the same observation as Courtney. If I run a notebook task in a foreach loop, each execution of the notebook starts a spark cluster. The idle time of the spark pool is set to 15 minutes so I don't really understand why the pool is not being reused by the notebook task. Any guidance on how to optimize this process would be appreciated.
I did not find any "Auto Termination" setting noted in the above answer.
Mark

0 Votes 0 ·
Felix-5198 avatar image Felix-5198 MarkWojciechowicz-1790 ·

Hi! Were you able to find a solution to this? I am experiencing the exact same issue, trying to sequentially execute the same notebook with different parameters using a Synapse Pipeline. Even if i specify to use the same cluster, each execution becomes its own Spark Application, requiring it to restart the cluster and ask for cores. This adds 1-3 minutes to an otherwise 30 second execution time - meaning some 80% of my execution time will just be starting up clusters.. This cant be the only/intended way of using notebooks in pipelines? :/

1 Vote 1 ·
MarkWojciechowicz-1790 avatar image
0 Votes"
MarkWojciechowicz-1790 answered Felix-5198 commented

@Felix-5198 not really. My work around wound up being to call the notebook once and send the parameters in as an array, then loop through that array in the notebook. You lose the parallel execution in ADF, but it's still way faster than firing up a bunch of spark clusters.

· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

I'm faced with the same behaviour: the playback in the "Apache Spark Applications" view also shows an execution time of 48 seconds, but the total duration is around 4m29s. I've thought about the same (calling the notebook once and sending parameters in as an array) but this is different from how I intended on doing it in the first place.

Do we know if this design is by design or is this some kind of bug?

0 Votes 0 ·
Felix-5198 avatar image Felix-5198 KristofDeMiddelaer-0472 ·

@KristofDeMiddelaer-0472 Yeah, this is also a very bad deal, seeing how from my understanding you actually pay for the entire time you utilize the cluster - including startup time. I haven't gotten this verified yet, but based on my own testing and billing this seems to be the case. This means that my billing is around 80% startup time and 20% actual execution.

0 Votes 0 ·

@MarkWojciechowicz-1790 Yeah, I've landed in a similar solution myself - where the notebook is started once and then looks up all the needed configuration in a supporting table and loops through the applicable executions. But this is really not a good solution, seeing how this means no parallel executions (which for me is a big deal..).

I really hope this is solved somehow down the line. My understanding is that Databricks would be an alternative here, with their always online and scaling clusters.. Might have a look at that, but was hoping to be able to stay fully in Synapse for my project :/

0 Votes 0 ·