question

BlackBradyP-6004 avatar image
0 Votes"
BlackBradyP-6004 asked vipullag-MSFT commented

azure-CycleCloud Unable to scale to 16 nodes and finalize cluster build.

We are attempting to spin up a 16 node HPC cluster using cyclecloud. Our cyclecloud system is configured to allow for 100 HPC nodes and 1024 cores. When we attempt to spin up 16 servers of HC44rs we only get about 14 nodes. The remaining 2 are attempting to provision but seem to be hanging on the ganglia install, so they never finish the full install and slurm cannot release the job.

Any hints or areas for troubleshooting?

azure-cyclecloud
· 4
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@BlackBradyP-6004

I am checking with internal team on this.
Will update back.

0 Votes 0 ·

@BlackBradyP-6004
Can you share the version of CycleCloud you are running?
click the “?” icon at the top right of the GUI, that will take you to About page with version.

If you are running a version of CycleCloud < 7.9.6, then request to upgrade to 7.9.6? The ganglia recipes were updated to accommodate some OS changes in that version.
Here is the doc on how to upgrade.

If not, please open the “Show Details” dialog for one of the nodes that is failing to start and share a screen shot.
To further troubleshoot, request you to SSH to one of the nodes and send us the logs:

 # ssh to failing node, then:
 sudo -i
 cd /opt/cycle/jetpack
 tar czf /tmp/logs.tgz logs
 exit
 # scp the logs.tgz back 

Requesting you to email the logs to AzCommunity@microsoft.com with Subject as "ATTN: Vikas" and mention this thread link in the email body.



0 Votes 0 ·

@BlackBradyP-6004

Any update on the issue?

Just checking in to see if you got a chance to look at the troubleshooting suggestions provided.

0 Votes 0 ·

@BlackBradyP-6004

Any update on the issue?

0 Votes 0 ·

0 Answers