The case of the server who couldn’t join a cluster – operation returned because the timeout period expired

Hi everyone! Today I am going to present a very interesting case which we recently handled in EMEA CSS. The customer wanted to build a Windows Server 2008 SP2 x64 Failover Cluster but wasn’t able to add any nodes to it. The “Add Node” wizard timed out and the following error was visible in the “addnode.mht” file in the %SystemRoot%\Cluster\reports directory:

The server “nodename” could not be added to the cluster. An error occurred while adding node “nodename” to cluster “clustername” . This operation returned because the timeout period expired.

First of all let me give you some details of the setup:

- Windows Server 2008 SP2 x64 Geospan Failover Cluster
- Node + Disk majority quorum model
- EMC Symmetrix storage
- Powerpath 5.3 SP1
- EMC Cluster Extender 3.1.0.20
- Mpio.sys 6.0:6002.18005 / Storport.sys 6.0:6002.22128 / Qlogic QL2300 HBA with v 9.1:8.25
- Broadcom NetXtreme II NICs v. 4.6:55.0

We initially saw some Physical Disk related errors/warnings in the validation tests:

- SCSI page 83h VPD descriptors for the same cluster disk 0 from different nodes do not match
- Successfully put PR reserve on cluster disk 0 from node “nodename” while it should have failed
- Cluster Disk 0 does not support Persistent Reservation
- Cluster Disk 1 does not support Persistent Reservation
- Failed to access cluster disk 0, or disk access latency of 0 ms from node “nodename” is more than the acceptable limit of 3000 ms status 31

validate

The following was also noticed in the cluster log file:

00000f94.00001690::10:21:48.502 CprepDiskFind: found disk with signature 0xa34c96dc
00000f94.00001690::10:21:48.504 CprepDiskPRUnRegister: Failed to unregister PR key, status 170
00000f94.00001690::10:21:48.504 CprepDiskPRUnRegister: Exit CprepDiskPRUnRegister: hr 0x800700aa

As described in the “ Microsoft Support Policy for Windows Server 2008 Failover Clusters ”, this behavior can be ignored for this particular (geographically dispersed) cluster due to the data replication model in use.

"Geographically dispersed clusters - Failover Cluster solutions that do not have a common shared disk and instead use data replication between nodes may not pass the Cluster Validation storage tests. This is a common configuration in cluster solutions where nodes are stretched across geographic regions. Cluster solutions that do not require external shared storage failover from one node to another are not required to pass the Storage tests in order to be a fully supported solution. Please contact the data replication vendor for any issues related to accessing data on failover"

From this point on, what we needed to focus on was a pure connectivity issue between the nodes. As we cross referenced several cluster related log files, we noticed this particular error from the Cluster Server, approximately at the same time the wizard gave up due to timeouts:

000012f8.00000758::2010/04/09-13:15:44.518 INFO [NETFT] Disabling IP autoconfiguration on the NetFT adapter.
000012f8.00000758::2010/04/09-13:15:44.518 INFO [NETFT] Disabling DHCP on the NetFT adapter.
000012f8.00000758::2010/04/09-13:15:44.518 WARN [NETFT] Caught exception trying to disable DHCP and autoconfiguration for the NetFT adapter. This may prevent this node from participating in a cluster if IPv6 is disabled.
000012f8.00000758::2010/04/09-13:15:44.518 WARN [NETFT] Failed to disable DHCP on the NetFT adapter (ethernet_7), error 1722.
[…]
00001460.00001038::2010/04/09-13:16:44.987 ERR [CS] Service OnStart Failed, WSAEADDRINUSE(10048)' because of '::bind(handle, localEP, localEP.AddressLength )'

In addition, we caught this warning Event ID 1555 from Failover Clustering in the system event log of the node:
event

A good description of this event can be found on TechNet:

Event ID 1555 – Cluster Network Connectivity
http://technet.microsoft.com/en-us/library/cc773407(WS.10).aspx

A failover cluster requires network connectivity among nodes and between clients and nodes. Problems with a network adapter or other network device (either physical problems or configuration problems) can interfere with connectivity. Attempting to use IPv4 for '%1' network adapter failed. This was due to a failure to disable IPv4 auto-configuration and DHCP. IPv6 will be used if enabled; else this node may not be able to participate in a failover cluster.

The following troubleshooting steps were implemented, settings were checked but unfortunately the behavior did not improve:

· IPV6 was un-checked as a protocol -> re-enabled it
· Disable Teredo as per KB969256 -> doesn’t apply to SP2
· Disable Application Layer Gateway service as per KB943710 –> done
· Stop / Disable service & uninstall Sophos Antivirus software from both nodes –> done
· Virtual MAC of the NetFT adapter was different on both nodes (they are identical when the nodes are sysprepped after installing the Failover Cluster feature)
· Disabled all physical NICs except one on both nodes & configured it as “allow cluster to use this network” as well as “allow clients to connect over this network
· Disabled TCP Chimney, RSS and NetDMA manually via netsh as per KB951037
netsh int tcp set global RSS=Disabled
netsh int tcp set global chimney=Disabled
netsh int tcp set global autotuninglevel=Disabled
netsh int tcp set global congestionprovider=None
netsh int tcp set global ecncapatibility=Disabled
netsh int ip set global taskoffload=disabled
netsh int tcp set global timestamps=Disabled
[Dword] EnableTCPA=0 in “HKLM\SYSTEM\CCS\Services\Tcpip\Parameters\”

· Rebooted several times in the process and even disabled the HBAs in Device Manager to completely cut off from the storage.

At this point we decided to bump the cluster debug log level to 5, reproduce the problem and hope we get more details than with default logging:

· Stopped the cluster service
· Rename the “%SystemRoot%\Cluster\reports” directory to “reports_old”
· Created a new reports directory
· Started back the cluster service
· Enabled log level 5 by cluster . /prop clusterloglevel=5
· Added a node by cluster . /add /node:”nodename”
· Generate the cluster log with cluster log /g after add node timed out.

What we saw was that the counter went up to 100% and remained stuck at this last phase, which was somewhat expected…important was to catch the timeout in more detailed logs:

“Waiting for notification that node “nodename” is a fully functional member of the cluster”
“This phase has failed for cluster object “nodename” with an error status of 1460 (timeout)".

ERR [IM] Unable to find adapter[….] DBG [HM] Connection attempt to node failed with WSAEADDRNOTAVAIL(10049): Failed to connect to remote endpoint

This information didn’t bring any groundbreaking clues to the case, so we decided to sit back for a minute, recap everything we did and knew so far and try to find the “missing link”:

· The customer had other Windows Server 2008 x64 SP2 clusters in production with no issues
· They used storage from the same IPV6 is enabled by default on every adapter in Windows Server 2008 but the customer decided to “unclick” it because they didn’t really have the infrastructure ready for it
· IPv4 IPs were static on both nodes, gateway and DNS servers manually assigned
· Everything looked fine and yet…the NetFT adapter seemed to have a problem.

The fact that they used static configuration coupled with the tendency to disable features that are not implemented or needed, led us to check the DHCP Client service….well, it was actually stopped and disabled! We also took a look at the TCP/IP parameters in the registry and noticed that IPv6 was also (globally) disabled.
We used the information in KB929852, removed the DisabledComponents in HKLM\SYSTEM\CCS\Services\Tcpip\Parameters\and rebooted the servers. Previously we manually started the DHCP Client service as well as changed the start type to automatic and as soon as the nodes were up and running, we initiated another node join and the other node joined successfully in less than 10 seconds!

The customer was thrilled to have his cluster ready in time for production go-live and we were happy to have played the decisive role in reaching the deadline.

As always feel free to comment/rate this article if you think this information was helpful to you as well.

Bogdan Palos
- Technical Lead / Enterprise Platforms Support (Core)