Stop Slurm nodes deallocating/terminating whilst there are still jobs in the queue?

Matt Jackson 6 Reputation points
2021-03-31T08:32:57.307+00:00

Hi,

I was wondering if I am missing some configuration parameters in CycleCloud (or Slurm) to prevent Slurm nodes being span up and down just for single jobs.

I have disabled autoscaling for now and set nodes to deallocate instead of terminating on stop. If I start a bunch of nodes before I submit any jobs these then straight away run the jobs at the top of the queue. However, these then deallocate and new nodes (of the same size) spin up to handle the next jobs in the queue.

Given the time it takes to aquire the VMs, is there a method to stop the 'Ready' nodes deallocating and then use those for the next jobs in the queue?

Thanks

Azure CycleCloud
Azure CycleCloud
A Microsoft tool for creating, managing, operating, and optimizing high-performance computing (HPC) and big compute clusters in Azure.
62 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. Matt Jackson 6 Reputation points
    2021-04-07T07:59:50.777+00:00

    I was using the default Slurm cluster config, setting it up via the CycleCloud GUI, but I was using Slurm v19.
    I will try a cluster using v20 and see if I get the expected results and provide an update.

    Thanks for the advice.
    Matt

    1 person found this answer helpful.
    0 comments No comments

  2. KarishmaTiwari-MSFT 18,642 Reputation points Microsoft Employee
    2021-04-07T01:25:44.087+00:00

    @Matt Jackson Can you please share which version of Slurm is this? Is it an entirely CycleCloud managed cluster with our
    out of the box configuration, or is it a custom installation? If you’ve modified slurm.conf, could you share that?

    We depend entirely on Slurm for the job allocation, and depending on the configuration and timing there are definitely scenarios where Slurm will spin up a new node rather than reuse the existing node. In general, we have seen the “expected behavior” more than we see Slurm refusing to reuse idle nodes.

    Also, Slurm v19 and below version don't re-use the idle nodes that are already powered on.
    This is a known behavior for those version.
    Starting from Slurm v20 this behavior has changed and idle nodes, that are already powered on, are re-used instead of spinning up new ones.

    Please let me know the details and I can help further. Thanks.

    0 comments No comments

  3. Matt Jackson 6 Reputation points
    2021-04-09T07:53:50.723+00:00

    I did a quick test using v20. It did seem to keep some of the nodes active whilst there were still jobs in the queue.

    I am still having some issues with the slow time to spin nodes up and down which seems to be causing issues with the job creation workflow I am using (nextflow) but I think this is independent of the original question I raised.

    Thanks again,
    Matt

    0 comments No comments