Troubleshooting AD Replication

Replication is another common AD trobleshooting scenario.

AD replication issues usually turn out to be caused by one of the following:

a) Faulty, substandard or misconfigured network equipment or WAN links
b) USN rollback issues caused by using unsupported restore methods (disk imaging of DC's, P2V utilities, snapshots, etc.)
c) DNS issues
d) Lingering objects

For 'a'; the classic examples are VPN Accelerators, Firewalls that are either rejecting traffic or only allowing packets of a specific size through, Stateful Packet Inspection on Firewalls, etc. A firewall that is 'allowing all traffic through' is still a firewall that can be affecting the replication.
This includes personal firewalls or network filters installed locally on DC's and can even include the Windows Firewall Service or ISA Server Firewall Client if it is running on the DC.

For troubleshooting AD Replication, Repadmin is the first and best tool that you should use. RPCPing and Portquery are useful to isolate the problem further, pinging between the DC's with a specific packet size (1500 for example) can be used to get further confirmation.
Getting a simultaneous network trace from both DC's involved in the replication should also tell you if a network device in between is causing replication problems.

For 'b'; the most common problem scenario arises when running DC's virtualized and using either snapshots, rollbacks or images to restore the DC to a previous state.
At this point in time, if the DC in question has replicated with other DC's in the time between the snapshot and the present, those DC's will consider themselves to be more up-to-date than the restored DC.
Any new change or addition you make on that DC will not be replicated out.

In the past, USN Rollbacks were hard to detect since the only indicators were that changes to specific objects weren't replicating from the problem DC.
Since W2k3 SP1, we have the infamous NTDS Replication 2095 event which indicates your DC is in USN Rollback state.
This is accompanied by the DC pausing the Netlogon service to make sure that no new accounts or changes are made, which makes sense because at this point you should be searching for your install media, running dcpromo /forceremoval on it and preparing for Metadata Cleanup of the DC.

Why is there a difference between 'supported' backup methods and rolling back to an earlier state using snapshots, virtualization, etc.?
The simple version is that when you restore a DC using a supported backup tool, the DC gets a new invocation ID and the old one is retired.
All other DC's will then pick this up and effectively consider the restored DC to be a new replication partner which causes them to start from the beginning when considering USN versions, etc.

For 'c'; make sure you're not Scavenging the DNS zone that contains the _msdcs container. If you absolutely *must* run scavenging, make sure you limit it to one DC and don't set it to scavenge more often than every 2-3 days. If you’re using a non-Microsoft DNS, make sure all the correct DNS GUID’s are in place in _msdcs if the DNS Server doesn’t support dynamic registration.

To identify USN Rollback manually, run the following for each Naming Context (the syntax below is for the ForestDnsZones NC):

Repadmin /showutdvec <FQDNS of DC1> <Naming Context, f.x.DC=ForestDnsZones,DC=Domin,DC=com>

Repadmin /showutdvec <FQDNS of DC2> <Naming Context, f.x.DC=ForestDnsZones,DC=Domin,DC=com>

Look at the USN value for each of the DC's, if DC2 has a higher USN value for DC1 than DC1 has for itself, it usually means you're in USN rollback state for that specific Naming Context.
If DC2 has a lower USN value for DC1, it just means that it hasn't replicated changes from DC1 yet.
Incidentally, restored DC's will show up in the output with a (retired) reference behind them (because they get a new invocation ID).

A key thing to remember is that all replication is inbound, if you look at 2 DC‘s each one has their own view of the configuration partition for the forest. If replication is working fine, their view should be identical but to confirm that you should be connecting to different DC‘s when running AD Sites & Services to get an opinion from more than one DC.

In case you ever get into the situation that you need to manually add replication connection for a specific naming context to a DC but the naming resolution isn't working properly, you can use as a hint for repadmin:

Example : repadmin /add DC=ForestDNSZones,DC=forestRoot,DC=com

For 'd'; With the default settings of 180 days for Tombstone Lifetime and Strict Replication Consistency being turned on have done a lot to improve the situation with Lingering Objects.
This however only applies by default for DC's in forests that were created from a Windows Server 2003 server, for domains upgraded from a Windows 2000 domain the default of 60 days and Strict Replication Consistency defaulting to Off still apply.

For Windows 2000-upgraded forests, the TL can easily be changed with ADSIEdit, changing the default for Strict Replication Consistency is more difficult but involves creating the GUID CN=94fdebc6-8eeb-4640-80de-ec52b9ca17fa,CN=Operations,CN=ForestUpdates,CN=Configuration,DC=<domain>,DC=<com> in the configuration container (compare with the same GUID on a clean W2k3 forest for more details).

If you're dealing with Lingering Objects in your domain, Repadmin /removelingeringobjects is the best way to remove them (if they're truly lingering objects and not orphaned objects).  Turning off Strict Replication Consistency to get replication going is not resolving the issue, it's making it worse but spreading the problem further.
See also for a more thorough discussion about lingering object removal.

See also: How to detect and recover from a USN rollback in Windows Server 2003

How to detect and recover from a USN rollback in Windows 2000 Server

Domain controllers do not demote gracefully when you use the Active Directory Installation Wizard to force demotion in Windows Server 2003 and in Windows 2000 Server

How to remove data in Active Directory after an unsuccessful domain controller demotion