Repadmin Error: While accessing the hard disk, a disk operation failed even after retries

I ran into an interesting replication issue that I was troubleshooting a few days ago and I thought it would be a good idea to post the issue with the fix action. I was running repadmin /showreps and I found that a domain controller was failing to replicate with a few replication partners.  

Issue:

During the beginning phase of troubleshooting I found that DC01 was failing to pull all but 2 partitions from DC02. Even more interesting DC01 was pulling 15 out of the 17 partitions successfully from DC02. When I ran repamin /showreps dc01 “I got the following error”

DsReplicaSync() failed with status 1127 (0x467):

While accessing the hard disk, a disk operation failed even after retries.

Here is the breakdown from the repamin /showreps output:

DC01 pulling from DC02 “intra-site partners” failing to replicate all but 2 partitions.

DC02 pulling from DC01 “intra-site partners” replicating all partitions.

DC01 pulling from DC03 “inter-site partners” failing to replicate all but 1 partitions.

DC03 pulling from DC01 “inter-site partners” replicating all partitions.

Because I was getting the hard disk error I did check the hard drives and the array for issues of a possible hard drive failure on DC01. Unfortunately there was no indication of a hard drive failure in the Dell OpenManage console.

I checked the event log on DC01 and I found the following error:

Log Name: Directory Service

Source: Microsoft-Windows-ActiveDirectory_DomainService

Event ID 1084

Task Category: Replication

Computer: DC01

Description:

Internal event: Active Directory could not update the following object with changes received from the following source domain controller. This is because an error occurred during the application of the changes to Active Directory on the domain controller.

Object: distinguished_name_path_of_object_that_failed_to_write_to_local_database

Object GUID: 32_character_alpha-numeric_object_GUID

Source domain controller:object_GUID_for_source_domain_controller's_NTDSDSA_object._msdcs

I did a Bing search and found the following article: http://support.microsoft.com/kb/837932. I chose option 2 which was to uncheck the Global Catalog on the DC01.

Action:

  • I unchecked the Global Catalog option in “AD Sites and Services” on DC01. Depending on the environment size this process can take a while because it has to remove all the GC partitions.
  • After an hour I checked replication and I was still getting the original error:

DsReplicaSync() failed with status 1127 (0x467):

While accessing the hard disk, a disk operation failed even after retries.

  • Sometimes you get different error messages in the GUI so I opened “AD Sites and Services” and right clicked on the connection object for DC01 and DC02 then select replicate now. I got the following error:

The following error occurred during the attempt to synchronize naming context fqdn "This is the same partition from previous event ID 1084" from Domain Controller DC01 to DC02 the naming context is in the process of being removed or is not replicated from the specified server.

I checked the event log on DC01 and it was unable to remove the objects in the GC partition. I was getting error:

Log Name: Directory Service

Source: Microsoft-Windows-ActiveDirectory_DomainService

Event ID 1661

Task Category: Replication

Computer: DC01

Description:

Active Directory Domain Services did not remove objects of the following partition from the local Active Directory Domain Services database.

Directory partition:

DC=local ,DC=domain ,DC=com "The error was referencing all the GC partitions"

At this point my only option was to run an offline defrag on the ntds.dit using the following article http://technet.microsoft.com/en-us/library/cc794920(WS.10).aspx.

After I restarted the ntds service there was no errors when the database mounted.

  • To save time I manually replicated all partitions.
  • Command repadmin /replicate <dest-dc01> <source-dc02> DC=full DN path of partition
  • The DC started replicating all partitions with all its partners.
  • I verified the ActiveDirectory_DomainServices event ID 1660: The removal of the following partition from the local Active Directory Domain Services database completed. Verified this was recorded for all the GC partitions.
  • I selected the Global Catalog option in “AD Sites and Services” on the NTDS Settings.
  • Then after an hour or so the DC was a Global Catalog again and it was successfully replicating with its partners again.

Additional information:

I later found that there was a hard drive failure on DC01 about a month prior and the drive was replaced. The last successful replication occurred just before the hard drive crashed. It appears this issue was caused by a disk failure that corrupted a section in the ntds.dit.

Links:

Event ID 2108 and Event ID 1084 occur during inbound replication of Active Directory in Windows 2000 Server and in Windows Server 2003

http://support.microsoft.com/kb/837932

Compact the Directory Database File (Offline Defragmentation)

http://technet.microsoft.com/en-us/library/cc794920(WS.10).aspx

Repadmin Syntax

http://technet.microsoft.com/en-us/library/cc770963(WS.10).aspx

Virus scanning recommendations for Enterprise computers that are running currently supported versions of Windows

http://support.microsoft.com/kb/822158