Headnode reports itself as Unreachable

I find a lot of people are starting to build demonstration clusters on a small scale with some sort of minimal ‘enterprise’ network. In my case the enterprise network is my home router and the private network is a little 5 port GigE switch.

To make it a bit more realistic and to be able to test configurations where the head node is _not_ the Active Directory Domain controller, I created a Domain Controller for my test network. I then added the cluster head node to this domain and installed the HPC 2008 Pack. The setup worked like a champ.

Then about one month later I was trying to run some tests when I noticed that the head node reported that it was ‘Unreachable’. How can this be I thought? All its networks were active, it could ping itself and the router and other private network nodes. Finally I stumbled onto trying to ping the Domain Controller. Surprise! No response from the DC. It seems that my 5 year old NIC on the Domain Controller had crossed the digital divide to the land of failed hardware.

Replacing the NIC, and installing a valid 64-bit driver for it brought the DC back into the network and in next to no time the head node self-reported as reachable.

When a domain joined system boots, it tries to contact its domain controller. If it can’t, it will come up and allow console logins on cached credentials, but periodically look for a domain controller connection to see if it can trust itself.

In my case the broken NIC prevented DC access, but the cached credentials allowed console logins.

I have seen similar Unreachable situations when a compute node cannot reach its domain controller. If you log on to a node immediately after it has rebooted and use the Event Viewer to look at the Windows Logs-> Security events you will see numerous Logon and Special Logon events. This is how the node establishes itself as being a member of the cluster, and if these fail, the node will appear as Unreachable.