question

LisettevanLeeuwen-4831 avatar image
0 Votes"
LisettevanLeeuwen-4831 asked prmanhas-MSFT commented

HPC stalled tasks

Hi,

At random, some of our calculations seem to be stalling; one HPC task never finishes. The corresponding HpcServiceHost.exe process on the compute node seems to continue using resources. Manually killing the process causes the job to reschedule the task, after which it finishes correctly.

We have added some console logging to our application and the task output shows a starting logging line, but not a finished logging line. Therefore we tried to determine whether the problem is caused by a bug in our application, but we couldn't find anything.

Next, we looked at the SOA traces and the log files (log level 4) on the head node and the compute node. The only thing we found there is that the head node never gets a signal of the compute node that the task has finished and that the compute node never sends such a message, which is what I expected. Interesting is that the SOA traces of the compute node stop when all the other tasks (except the stalled one) have finished, while the SOA traces of the head node continue until the job is manually canceled.

Do you have any thoughts on what could be the problem or where we could look to determine the cause?

We're hosting an on-premise HPC cluster on HPC Pack 2016 Update 3 and we use the NuGet SDK to connect to the cluster.

Regards,
Lisette

azure-hpc-pack
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@LisettevanLeeuwen-4831 Just following up to check if you got a chance to go through my previous response?

Do let me know in case of any queries.

Thanks

0 Votes 0 ·

0 Answers