HPC Pack 2019 - Unable to connect to the head node running cluster manager on the head node server

Oscar 6 Reputation points
2021-08-04T11:36:32.713+00:00

I'm having trouble starting HPC Cluster Manager and Job Manager locally on the Head Node, getting the following error almost every time (sometimes it works):

Unable to connect to the head node.

The connection to the management service failed. detail error: Microsoft.Hpc.RetryCountExhaustException: Retry Count of RetryManager is exhausted. ---> System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full
   at System.Net.Sockets.Socket.DoBind(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.InternalBind(EndPoint localEP)
   at System.Net.Sockets.Socket.BeginConnectEx(EndPoint remoteEP, Boolean flowContext, AsyncCallback callback, Object state)
   at System.Net.Sockets.Socket.UnsafeBeginConnect(EndPoint remoteEP, AsyncCallback callback, Object state)
   at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Exception& exception)
   --- End of inner exception stack trace ---
   at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)
   at System.Net.Http.HttpClientHandler.GetResponseCallback(IAsyncResult ar)
   --- End of inner exception stack trace ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Hpc.HttpClientExtension.<>c__DisplayClass5_0.<
Azure HPC Cache
Azure HPC Cache
An Azure service that provides file caching for high-performance computing.
23 questions
{count} vote

3 answers

Sort by: Most helpful
  1. Yutong Sun 261 Reputation points Microsoft Employee
    2021-08-05T15:06:31.377+00:00

    Hi SvenssonOscar,

    This issue could be caused by the depletion of max user ports for TCP connections. You may run the following commands to check and modify the max user ports.

    netsh int ipv4 show dynamicport tcp

    netsh int ipv4 set dynamicport tcp start=10000 num=55536

    Regards,
    Yutong Sun

    1 person found this answer helpful.

  2. Kuenne, Michael 1 Reputation point
    2021-12-20T09:09:45.553+00:00

    Hello all together.

    We are running into the very same problem described at the beginning.

    Using Windows Server 2016 Standard (14393.4770) and HPC Pack 2016 (5.2.6291.0).

    Already changed the portrange to

    netsh int ipv4 show dynamicport tcp
    
        Protocol tcp Dynamic Port Range
        ---------------------------------
        Start Port      : 1025
        Number of Ports : 40000
    

    it ran longer until the same error behavior occurred.

    Seems to me that the HPC scheduler opens a lot winsock based connections until it exhausts the configured number of ports.

    Is there a way to solve this behavior without restarting the server?

    Many thanks in advance and regards,
    Michael


  3. saszhuqing 1 Reputation point
    2022-11-25T08:15:30.907+00:00

    Hi SvenssonOscar,
    I got the same problem, WinSrv 2022 & HPC 2019 update1, and I think it caused by TLS1.0 was disabled.
    I used "IISCrypto.exe" on the head node, and click "Best Practices" and reboot, then everything is OK .

    0 comments No comments