HPC Pack 2019 - Unable to connect to the head node running cluster manager on the head node server

Question

I'm having trouble starting HPC Cluster Manager and Job Manager locally on the Head Node, getting the following error almost every time (sometimes it works):

Unable to connect to the head node.

The connection to the management service failed. detail error: Microsoft.Hpc.RetryCountExhaustException: Retry Count of RetryManager is exhausted. ---> System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full
   at System.Net.Sockets.Socket.DoBind(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.InternalBind(EndPoint localEP)
   at System.Net.Sockets.Socket.BeginConnectEx(EndPoint remoteEP, Boolean flowContext, AsyncCallback callback, Object state)
   at System.Net.Sockets.Socket.UnsafeBeginConnect(EndPoint remoteEP, AsyncCallback callback, Object state)
   at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Exception& exception)
   --- End of inner exception stack trace ---
   at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)
   at System.Net.Http.HttpClientHandler.GetResponseCallback(IAsyncResult ar)
   --- End of inner exception stack trace ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Hpc.HttpClientExtension.<>c__DisplayClass5_0.<

Answer

Hi SvenssonOscar,

This issue could be caused by the depletion of max user ports for TCP connections. You may run the following commands to check and modify the max user ports.

netsh int ipv4 show dynamicport tcp

netsh int ipv4 set dynamicport tcp start=10000 num=55536

Regards,
Yutong Sun

Answer

Hello all together.

We are running into the very same problem described at the beginning.

Using Windows Server 2016 Standard (14393.4770) and HPC Pack 2016 (5.2.6291.0).

Already changed the portrange to

netsh int ipv4 show dynamicport tcp

    Protocol tcp Dynamic Port Range
    ---------------------------------
    Start Port      : 1025
    Number of Ports : 40000

it ran longer until the same error behavior occurred.

Seems to me that the HPC scheduler opens a lot winsock based connections until it exhausts the configured number of ports.

Is there a way to solve this behavior without restarting the server?

Many thanks in advance and regards,
Michael

Answer

Hi SvenssonOscar,
I got the same problem, WinSrv 2022 & HPC 2019 update1, and I think it caused by TLS1.0 was disabled.
I used "IISCrypto.exe" on the head node, and click "Best Practices" and reboot, then everything is OK .

HPC Pack 2019 - Unable to connect to the head node running cluster manager on the head node server

3 answers