Hi,
At random, some of our calculations seem to be stalling; one HPC task never finishes. The corresponding HpcServiceHost.exe process on the compute node seems to continue using resources. Manually killing the process causes the job to reschedule the task, after which it finishes correctly.
We have added some console logging to our application and the task output shows a starting logging line, but not a finished logging line. Therefore we tried to determine whether the problem is caused by a bug in our application, but we couldn't find anything.
Next, we looked at the SOA traces and the log files (log level 4) on the head node and the compute node. The only thing we found there is that the head node never gets a signal of the compute node that the task has finished and that the compute node never sends such a message, which is what I expected. Interesting is that the SOA traces of the compute node stop when all the other tasks (except the stalled one) have finished, while the SOA traces of the head node continue until the job is manually canceled.
Do you have any thoughts on what could be the problem or where we could look to determine the cause?
We're hosting an on-premise HPC cluster on HPC Pack 2016 Update 3 and we use the NuGet SDK to connect to the cluster.
Regards,
Lisette