Troubleshoot cluster issue with Event ID 1135

This article helps you diagnose and resolve Event ID 1135, which may be logged during the startup of the Cluster service in Failover Clustering environment.

Applies to:   Windows Server 2022, Windows Server 2019, Windows Server 2016, Azure Stack HCI, versions 21H2 and 20H2

Try our Virtual Agent - It can help you quickly identify and fix common Active Directory replication issues.

Start page

Event ID 1135 indicates that one or more Cluster nodes were removed from the active failover cluster membership. It may be accompanied by the following symptoms:

A validation and the network tests would be recommended as one of the initial troubleshooting steps to ensure there are no configuration issues that might be a cause for problems.

The Cluster service is the essential software component that controls all aspects of failover cluster operation and manages the cluster configuration database. If you see the event ID 1135, we recommend you to install the fixes mentioned in the following articles and reboot all the nodes of the cluster, then observe if issue reoccurs.

Check if the cluster service running on all the nodes

Follow the following command according to your Windows operation system to validate that cluster service is continuously running and available.

For Windows Server 2008 R2 cluster

From an elevated command prompt, run cluster.exe node /stat.

For Windows Server 2012 and Windows Server 2012 R2 cluster

Run the following PowerShell cmdlet: Get-ClusterResource

Is the cluster service continuously running and available on all the nodes?

Several scenarios of Event ID 1135

We want you to take a closer look on at the System Event logs on all the nodes of your cluster. Review the Event ID 1135 that you're seeing on the nodes and copy all the instances of this event. This will make it convenient for you to look at them and review.

Event ID 1135
Cluster node ' **NODE A** ' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. 
This could also be due to the node having lost communication with other active nodes in the failover cluster. 
Run the Validate a Configuration wizard to check your network configuration. 
If the condition persists, check for hardware or software errors related to the network adapters on this node. 
Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

There are three typical scenarios:

Scenario A

You're looking at all the Events and all the nodes in the cluster are indicating that NODE A had lost communication.

Diagram showing Node A, Node B, and Node C communicating successfully.

Diagram showing that Node A has lost communication with Node B and Node C.

It could be possible that when you're seeing the system logs on NODE A, it has events for all the remaining nodes in the cluster.

Solution

This quite suggests that at the time of issue, either due to network congestion or otherwise the communication to the NODE A was lost.

You should review and validate the Network configuration and communication issues. Remember to look for issues pertaining to Node A.

Scenario B

You're looking at the Events on the nodes and let us say that your cluster is dispersed across two sites. NODE A, NODE B, and NODE C at Site 1 and NODE D & NODE E at Site 2.

Diagram showing that Site 1 is communicating successfully with Site 2 over a WAN Link.

On Nodes A,B, and C, you see that the events that are logged are for connectivity to Nodes D & E. Similarly, when you see the events on Nodes D & E, the events suggest that we lost communication with A, B, and C.

Diagram showing that Site 1 has lost the WAN Link connection with Site 2.

Solution

If you see similar activity, it is indicative that there was a communication failure, over the link that connects these sites. We would recommend that you review the connection across the sites, if this is over a WAN connection, we would suggest that you verify with your ISP about the connectivity.

Scenario C

You're looking at the Events on the nodes and you see that the names of the nodes don't tally out with any particular pattern. Let us say that your cluster is dispersed across two sites. NODE A, NODE B and NODE C at Site 1 and NODE D & NODE E at Site 2.

  • On Node A: You see events for Nodes B, D, E.
  • On Node B: You see events for Nodes C, D, E.
  • On Node C: You see events for Nodes A, B, E.
  • On Node D: You see events for Nodes A, C, E.
  • On Node E: You see events for Nodes B, C, D.
  • Or any other combinations.

Diagram of Scenario C showing that your cluster is dispersed across two sites.

Solution

Such events are possible when the network channels between the nodes are choked and the cluster communication messages are not reaching in a timely manner, making the cluster to feel that the communication between the nodes is lost resulting in the removal of nodes from the cluster membership.

Review Cluster Networks

We would recommend that you review your Cluster Networks by checking the following three options one by one to continue this troubleshooting guide.

Check for Antivirus Exclusion

Exclude the following file system locations from virus scanning on a server that is running Cluster Services:

  • The path of the FileShare Witness
  • The %Systemroot%\Cluster folder

Configure the real-time scanning component within your antivirus software to exclude the following directories and files:

  • Default virtual machine configuration directory (C:\ProgramData\Microsoft\Windows\Hyper-V)

  • Custom virtual machine configuration directories

  • Default virtual hard disk drive directory (C:\Users\Public\Documents\Hyper-V\Virtual Hard Disks)

  • Custom virtual hard disk drive directories

  • Custom replication data directories, if you're using Hyper-V Replica

  • Snapshot directories

  • mms.exe

    Note

    This file may have to be configured as a process exclusion within the antivirus software.

  • Vmwp.exe

    Note

    This file may have to be configured as a process exclusion within the antivirus software.

Additionally, when you use Live Migration together with Cluster Shared Volumes, exclude the CSV path C:\Clusterstorage and all its subdirectories. If you're troubleshooting failover issues or general problems with Cluster services and antivirus software is installed, temporarily uninstall the antivirus software or check with the manufacturer of the software to determine whether the antivirus software works with Cluster services. Just disabling the antivirus software is insufficient in most cases. Even if you disable the antivirus software, the filter driver is still loaded when you restart the computer.

Check for network port configuration in firewall

The Cluster service controls server cluster operations and manages the cluster database. A cluster is a collection of independent computers that act as a single computer. Managers, programmers, and users see the cluster as a single system. The software distributes data among the nodes of the cluster. If a node fails, other nodes provide the services and data that were formerly provided by the missing node. When a node is added or repaired, the cluster software migrates some data to that node.

System service name: ClusSvc

Application Protocol Ports
Cluster Service UDP 3343
Cluster Service TCP 3343 (This port is required during a node join operation.)
RPC TCP 135
Cluster Admin UDP 137
Kerberos UDP/TCP 464*
SMB TCP 445
Randomly allocated high UDP ports** UDP Random port number between 1024 and 65535
Random port number between 49152 and 65535***

Note

Additionally, for successful validation on Windows Failover Clusters on Windows Server 2008 and above, allow inbound and outbound traffic for ICMP4, ICMP6.

This is the range in Windows Server 2012, Windows 8, Windows Server 2008 R2, Windows 7, Windows Server 2008, and Windows Vista.

Besides, run the following command to check for network port configuration in firewall. For example: This command helps determine port 3343 available\open used for Failover Cluster:

netsh advfirewall firewall show rule name="Failover Clusters (UDP-In)" verbose

Run the Cluster Validation report for any errors or warnings

The cluster validation tool runs a suite of tests to verify that your hardware and settings are compatible with failover clustering.

Follow these instructions:

  1. Run the Cluster Validation report for any errors or warnings. For more information, see Understanding Cluster Validation Tests: Network

    Screenshot of the results after running the Cluster Validation report for any errors or warnings.

  2. Verify for warnings and errors for Networks. For more information, see Understanding Cluster Validation Tests: Network.

    Screenshot of Results by Category.

    Screenshot of Validate Windows Firewall Configuration under Network.

Check the List Network Binding Order

This test lists the order in which networks are bound to the adapters on each node.

The Adapters and Bindings tab lists the connections in the order in which the connections are accessed by network services. The order of these connections reflects the order in which generic TCP/IP calls/packets are sent on to the wire.

Follow the below steps to change the binding order of network adapters:

  1. Select Start, select Run, type ncpa.cpl, and then select OK. You can see the available connections in the LAN and High-Speed Internet section of the Network Connections window.
  2. On the Advanced menu, select Advanced Settings, and then select the Adapters and Bindings tab.
  3. In the Connections area, select the connection that you want to move higher in the list. Use the arrow buttons to move the connection. As a general rule, the card that talks to the network (domain connectivity, routing to other networks, etc. should be the first bound (top of the list) card).

Cluster nodes are multi-homed systems. Network priority affects DNS Client for outbound network connectivity. Network adapters used for client communication should be at the top in the binding order. Non-routed networks can be placed at lower priority. In Windows Server 2012 and Windows Server 2012 R2, the Cluster Network Driver (NETFT.SYS) adapter is automatically placed at the bottom in the binding order list.

Check the Validate Network Communication

Latency on your network could also cause this to happen. The packets may not be lost between the nodes, but they may not get to the nodes fast enough before the timeout period expires.

This test validates that tested servers can communicate with acceptable latency on all networks.

For example: Under Validate Network Communication, you may see the following messages for network latency issues:

Succeeded in pinging network interface node003.contoso.com IP Address 192.168.0.2 from network interface node004.contoso.com IP Address 192.168.0.3 with maximum delay 500 after 1 attempt(s).
Either address 10.0.0.96 is not reachable from 192.168.0.2 or **the ping latency is greater than the maximum allowed 2000 ms** 
This may be expected, since network interfaces node003.contoso.com - Heartbeat Network and node004.contoso.com - Production Network are on different cluster networks
Either address 192.168.0.2 is not reachable from 10.0.0.96 or **the ping latency is greater than the maximum allowed 2000 ms** 
This may be expected, since network interfaces node004.contoso.com - Production Network and node003.contoso.com - Heartbeat Network for MSCS are on different cluster networks

For multi-site cluster, you may increase the time-out values. For more information, see Configure Heartbeat and DNS Settings in a Multi-Site Failover Cluster.

Check with ISP for any WAN connectivity issues.

Check if you encounter any of the following issues.

Network packets lost between nodes
  1. Check Packet loss using Performance

    If the packet is lost on the wire somewhere between the nodes, then the heartbeats will fail. We can easily find out if this is a problem by using Performance Monitor to look at the "Network Interface\Packets Received Discarded" counter. Once you have added this counter, look at the Average, Minimum, and Maximum numbers and if they are any value higher than zero, then the receive buffer needs to be adjusted up for the adapter.

    Screenshot of the Add Counters window.

    If you're experiencing network packet lost on VMware virtualization platform, see the "Cluster installed in the VMware virtualization platform" section.

  2. Upgrade the NIC drivers

    This issue can occurs due to outdated NIC drivers\Integration Components (IC)\VmTools or faulty NIC adapters. If there are network packets lost between nodes on Physical machines, please have your network adapter driver updates. Old or out-of-date network card drivers and/or firmware. At times, a simple misconfiguration of the network card or switch can also cause loss of heartbeats.

Cluster installed in the VMware virtualization platform

Verify VMware adapter issues in case of VMware environment.

This issue may occur if the packets are dropped during high traffic bursts. Ensure that there is no traffic filtering occurring (for example, with a mail filter). After eliminating this possibility, gradually increase the number of buffers in the guest operating system and verify.

To reduce burst traffic drops, follow these steps:

  1. Select Start, select Run, type devmgmt.msc and press Enter.
  2. Expand Network adapters, right-click vmxnet3 and select Properties.
  3. Select the Advanced tab.
  4. Select Small Rx Buffers and increase the value. The default value is 512 and the maximum is 8192.
  5. Select Rx Ring #1 Size and increase the value. The default value is 1024 and the maximum is 4096.

Check the following articles to verify VMware adapter issues in case of VMware environment:

Notice any Network congestion

Network congestion can also cause network connectivity issues.

Verify your network is configured as per MS and vendor recommendations, see Configuring Windows Failover Cluster Networks.

Check the network configuration

If it still doesn't work, please check if you have seen partitioned network in cluster GUI or you have NIC teaming enabled on the heartbeat NIC.

If you see partitioned network in cluster GUI, see "Partitioned" Cluster Networks to troubleshoot the issue.

If you have NIC teaming enabled on the heartbeat NIC, check Teaming software functionality as per teaming vendor's recommendation.

Upgrade the NIC drivers

This issue can occurs due to outdated NIC drivers or faulty NIC adapters.

If there are network packets lost between nodes on Physical machines, have your network adapter driver updates. Old or out-of-date network card drivers and/or firmware.

At times, a simple misconfiguration of the network card or switch can also cause loss of heartbeats.

Check the network configuration

If it still doesn't work, check whether you have seen partitioned network in cluster GUI or you have NIC teaming enabled on the heartbeat NIC.