question

ShannonCosgrove-3000 avatar image
0 Votes"
ShannonCosgrove-3000 asked vipullag-MSFT commented

Azure Cyclecloud Config file does not match number of CPUs on a node in each cluster.

Hi everyone.

I am using slurm to run a script on Azure Cyclecloud and the script uses all of the cores. When I run it on the cluster, it is only using half of the cores on the node. The cyclecloud.conf and slurm.conf files are only specifying 16 CPUs instead of the 32 on the nodes. When I change the conf file to have the correct number of CPUs (32) it still does not run on all of them. If I change the conf file AND remove/rescale the nodes it also does not run on all of them.

Please let me know if anyone can help. It seems like I need to change the CPUs in something that is controlling the conf files.

azure-cyclecloud
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

vipullag-MSFT avatar image
0 Votes"
vipullag-MSFT answered ShannonCosgrove-3000 commented

@ShannonCosgrove-3000

Thanks for reaching out to Microsoft Q&A Platform.

It might be that those machines may have Hyper-threaded enabled, that is why taking half the cores count.

You can try changing the cyclecloud.conf to match CPUs and ThreadsPerCore=1 and then restart slurmctld.

But these changes will get removed once you put remove_nodes/scale command. Check using "scontrol show nodes" and check "CPUTot".

Hope this helps.
Please 'Accept as answer' if the provided information is helpful, so that it can help others in the community looking for help on similar topics.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

ShannonCosgrove-3000 avatar image
0 Votes"
ShannonCosgrove-3000 answered vipullag-MSFT commented

Hi! Thanks @vipullag-MSFT for your answer. I did scontrol show nodes and this is what it is showing for this type of machine (Standard_F32s_v2)

CPUAlloc=0 CPUTot=16 CPULoad=11.72
AvailableFeatures=cloud
ActiveFeatures=cloud
Gres=(null)
NodeAddr=hpc-pg0-1 NodeHostName=hpc-pg0-1 Port=0 Version=19.05.8
OS=Linux 5.4.0-1064-azure #67~18.04.1-Ubuntu SMP Wed Nov 10 11:38:21 UTC 2021
RealMemory=62259 AllocMem=0 FreeMem=60022 Sockets=16 Boards=1
State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=hpc
BootTime=2022-01-14T02:04:35 SlurmdStartTime=2022-01-14T02:10:58
CfgTRES=cpu=16,mem=62259M,billing=16
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


I don't see anywhere in here that there are two threads per core or two sockets so I understand why slurm is getting confused and only running on half of the cores. But I used multiprocessing.cpu_count() and the multiprocessing function is seeing 32 cores. Any idea what I need to change in the slurm script to accomodate to this type of machine?

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@ShannonCosgrove-3000

Can you try this:

MPI_OPTIONS="--use-hwthread-cpus" in submission of the MPI jobs.
Came across similar issue and see this change helped other customer to run on all 32 CPUs for the VM type selected.

0 Votes 0 ·