Why does ReplDiag.exe error out with the message that the topology isn’t stable? [Part 3 of 4]

Hi, Rob here, fresh for 2011. Apologies for the late post, it’s a new year, coming off vacation and getting in the the swing of things. Hey, it’s CES this week too, so things are busy and we’re all dreaming about the new gadgets! Be sure to check out some of our announcements here. Alright, lets start off by posting our 3rd part in the 4 part series. Last month we looked at cleaning lingering objects across the entire forest. But wait! What if you didn’t get that far? What if the topology was reported to be unstable? What now? Despite the contrary, not all topologies are unstable, even with lingering objects in them. There may be the case of the unstable topology, so let’s take a look at what the definition is first.

Un-stable vs. stable

Dictionary definition: sta·ble (there are many means, some of which I’ve excluded purely for comedy reasons, as in “stable administrator” Smile)

–adjective, -bler, -blest.

1. not likely to fall or give way, as a structure, support, foundation, etc.; firm; steady.

2. able or likely to continue or last; firmly established; enduring or permanent: a stable government.

3. resistant to sudden change or deterioration: A stable economy is the aim of every government.

4. steadfast; not wavering or changeable, as in character or purpose; dependable.

If we were to put a percentage on the number of stable environments out there, I’d say 90% is stable in my experience, but what does ReplDiag actually look for?

When we talk about an environment that is stable what we are looking for is one where replication of an object from any DC to any other DC that may host the object (this includes if it is resident in a Read-Only Global Catalog partition) can occur within TSL. There are several broken replication scenarios which may cause this degraded state to occur. As we move into the discussion, keep in mind that all replication is pull based and the topology is built on a per Naming Context basis. So when we talk about stability here, one NC could be stable (i.e. the Europe domain) and one NC could be unstable (i.e. the North America domain).

As a result of these concerns, the topology has to be stabilized and time given to allow replication to converge. If this doesn’t happen, there are 3 consequences: current replication issues will continue to cause inconsistent views of the directory, new lingering objects will continue to be generated, and we can’t validate good versus bad data. Thus, cleaning existing lingering objects is an effort in futility until the replication is fixed so that notifications of deletions can converge across the DCs going forward.

· Scenario 1 – The DC has no inbound replication connections for a given NC. What this means is that the DC has no peers to pull it’s updates from. This means it will get neither new objects, updates to existing objects, nor deletions of objects. This has to be fixed by homing the administration tools (i.e. Sites and Services MSC) to the DC and adding an connection to another DC.
Note: This is probably the second most common problem scenario behind no replication across site boundaries. Though this usually happens when there is one instance of a partition in a site and site connectivity isn’t set up properly.

· Scenario 2 – There are no outbound connections for a given NC. This means that no other DC in the environment is pulling changes from said DC. This has to be fixed by homing the administration tools to ANY other DC, which is replicating properly, in the environment and setting up a connection to said DC. As before, this usually happens in the scenario where there is one instance of the partition in a site and site connectivity isn’t setup properly.

· Scenario 3 – There is no inbound replication in to a site for a given NC. This is very similar to Scenario 1, with the exception that if there are multiple DCs in a site they may be replicating with each other for a specified partition, but are not sharing that data with any DCs outside of the site. The fix is very similar to the fix for Scenario 1, but any DC in said site can have a connection to a DC outside the site.
Note: This is probably the most common trouble scenario, in part because it includes all of Scenario 1 where there is one DC in the site. This is entirely due to site connectivity configuration issues.

· Scenario 4 – There are no outbound connections from a site for a given NC. Just like Scenario 3, all DCs in a site may be replicating with each other, but none of that data is being shared with DCs outside of the site. The fix is the same as for Scenario 2.

· Scenario 5 – No writeable instances of a partition exist. This can happen in scenarios where a domain or application partition is deleted and the changes to the replication topology never converge. Thus one or more Global Catalogs are advertising the partition and the data within. This implicitly means that the configuration partition does not have a stable replication topology and is a victim of that instability. To fix, investigate Scenarios 1 through 4 for the configuration partition only and allow replication to converge.

· Scenario 6 – While a certain DC may have connections to peer DCs, if none of these connections have ever completed successfully, the partition may be in an inconsistent state. In this scenario, we don’t have data readily available (without starting to look deep into the metadata) to determine when this partition was last synchronized. Regardless of whether or not this may be a new problem, this is flagged as stability impacting as it is currently in a degraded state and needs to be reviewed.
Fixes: This is a little more complex to fix, as the reason all the connections are reporting as failed needs to be investigated. Often times this is related to firewall rules not being configured properly but it could also simply be due to the fact that this is a newly introduced DC that has not fully replicated due to either database size and/or network bandwidth.

· Scenario 7 – NC never completed outbound. This is very similar to Scenario 6, except that no other DC has been able to pull data from said source DC. This is usually related to firewalls.

· Scenario 8 – Server inaccessible. If a box is down, it isn’t replicating. While this is not generally a problem, when it comes to cleaning lingering objects, this has a major impact. Reviewing the strategy on Glenn’s blog , a comparison of all systems in the forest is necessary to ensure the infrastructure is as clean as possible. If a box is offline, it cannot be compared to its peers and thus lingering objects may be left in the forest. For this reason, cleaning of lingering objects is blocked until all boxes can be contacted.

So that about wraps things up… Looking forward to the final post (later this month) in the series, Part 4 of the ReplDiag breakdown “Can I clean one partition at a time with Repldiag, and other tips…”. Also, if you felt I have missed anything, please let me know.

One big ask. I’d also like to know how you’ve used the tool, your success with it, and other experiences or feedback on Repldiag. Suggestions are always welcome, but those should be sent to myself or directly to the Ken.