What is causing EventID 2153 MSRepl

Question

I was reviewing our Application logs on our 3 Exchange 2016 servers and came across the following error message.

The log copier was unable to communicate with server 'Exchange1.Domain.com'. The copy of database 'MailDB03\Exchange1' is in a disconnected state. The communication error was: An error occurred while communicating with server 'Exchange1'. Error: Unable to write data to the transport connection: An established connection was aborted by the software in your host machine. The copier will automatically retry after a short delay.

Our current setup is 3 exchange 2016 servers CU16. Exchange1 and Exchange2 are in our primary datacenter site. Exchange3 is in our backup datacenter site. All databases are active on Exchange1. I don't see the errors in the application logs on either of the other 2 nodes in our primary datacenter, only on Exchange3.

When I run the following commands I get:

Get-MailboxDatabaseCopyStatus * All 5 databases across the three nodes are healthy. CopyQueueLength and ReplayQueueLengths are 0. Occasionally they show 1 on ReplayQueueLength on either of the two passive nodes (Exchange2 and Exchange3).

Get-MailboxDatabaseCopyStatus -ConnectionStatus | FT Identity,IncomingLogCopyingNetwork on Exchange2 shows

MailDB01\Exchange2 {Exchange1,MapiDagNetwork} for all five databases

On Exchange3 (DR)

MailDB01\Exchange3 {Exchange1\MapiDagNetwork, An error occurred while communicating with server 'Exchange1'. Unable to write data to the transport connection: An established connection was aborted by the software in your host machine.}

I get the above message on all 5 databases on exchange3.

Test-replicationhealth on all nodes passes all tests.

I am thinking that maybe something with the fact that Exchange1 and Exchange2 are on the same network and Exchange3 is on a separate network in the Backup datacenter? Everything can ping each other.

I am going to do some failover testing tonight to see if there is any impact but if anyone has any ideas what this is or how to correct it please let me know.

Thanks

Answer

As a follow up I ended up contacting Microsoft on this issue. To resolve the issue we suspended database copy and resumed it. That seems have fixed it for now but I am continuing to monitor.

The 2 errors I was getting after running:

Get-MailboxDatabaseCopyStatus * | ft Name,Status,CopyQueuelength,ContentIndexState,IncomingLogCopyingNetwork -AutoSize

Were:
{Exchange1,MapiDagNetwork,An error occurred while communicating with server 'Exchange1'. Error: The requested address is not valid in its context}
and:
{Exchange1,MapiDagNetwork,An error occurred while communicating with server 'Exchange1'. Error: Unable to write data to the transport connection: An established connection was aborted by the software in your host machine.}

we ran:
Suspend-MailboxDatabaseCopy MailDB01\Exchange3
Resume-MailboxDatabaseCopy MailDB01\Exchange3

Hopefully that helps someone else with considerable less headaches.

Answer

Hi,@TRDx2

Thanks for the detailed information.
Since the "Get-MailboxDatabaseCopyStatus -ConnectionStatus | FT Identity,IncomingLogCopyingNetwork " on Exchange3 shows error on all 5 databases, the problem may be caused by the network problems between your PR and DR sites.

Besides failover testing, please also try removing a database copy from Exchange3 and reseeding it for test if possible.

In addition, is Exchange3 using the same hardware as Exchange1 or Exchange2?
According to this case, disk I/O problem would also be the cause.

If the response is helpful, please click "Accept Answer" and upvote it.
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

Answer

I have paused replication on each database and the resumed. No real issues with catching up. I created a new database and seeded to the node in question without issue. I have activated the databases on Exchange3 and their were no noticeable issues. When I put Exchange3 in maintenance and restart it the error seems to go away for a short time about an hour or so. Then it slowly comes back one database at a time.

To answer your question about hardware. All Exchange servers are virtualized. Exchange1 & 2 are on the same hardware. I am not sure what hardware Exchange3 is on.

The article you referenced I have also seen. I would like to also know what perfmon counters the original poster used to identify the issue. When I monitor Disk read and write queue length I don't see anything that would indicate an I/O. Should there be other counters that would be more useful to identify a disk issue?

What is causing EventID 2153 MSRepl

3 answers