We have 2 DHCP servers (both Windows 2012 R2 Core) in a hot failover configuration on separate subnets/sites connected over MPLS. Clients are the majority of the time taking a long time to receive an IP. I have figured out that the cause is something to do with the failover configuration.
From comparing the DHCP logs of both servers, what I can see happening is the following:
- Client sends a discover for an IP
- The standby server immediately sees the discover, gives out event ID 36 'packet dropped because of client ID hash mismatch' error (as it should because it is not the active partner). It is expecting the active partner to assign an IP.
- The active partner has no DHCP logs for this initial and many subsequent discovers, but at some point later it finally responds and assigns an IP. It can be anything from 5 minutes to 3 hours for it to assign an IP. Other times it assigns an IP at a perfectly synchronised time with the hot standby partner's initial event ID 36 error, so there is no delay.
We have the network configured correctly with ip-helpers pointing to both servers and I am 99% sure that both servers always receive the DHCP discover even though the active partner ignores it. We have tried recreating the failover configuration. I don't know how I can do a packet capture on Core servers to trace any network issues.
We see the same behaviour both ways - going from site 1 to site 2 and from site 2 to site 1 regardless of which lease range it is.
Servers are reconciled and replicated with no issues.
Hoping someone has some suggestions to resolve this!