sp_mergemetadataretentioncleanup not cleaning merge metadata during regular sync causing stale data in metadata tables

I recently came across an interesting situation in merge replication and thought I should write about it:

> There is a high volume replication setup with thousands of subscriber and a very volatile(lot of transactions per second) publisher database
> Started observing huge blocking when multiple merge agents starts synchronizing
> Even if merge agents are stopped, the time taken for a single merge agent is huge
> There are millions of rows in merge metadata tables (msmerge_genhistory, msmerge_contents and msmerge_tombstone) and they are not getting cleaned up
> Running update stats every hour does not help

 > The huge backlog in the merge metadata tables was causing the merge queries to take long times. These long times were resulting in blocking when multiple merge agents were sync'ing.
 > This is because the msmerge_genhistory metadata table is referred everywhere during merge sync. Any query that joins this table (and there are a lot of them) had to get blocked since some other queries had taken locks on the same tables rows.
 > The root cause of this problem was the millions of rows those metadata tables were not just getting cleaned up.

What I found out:
 > Once I found that the root cause of the issue is the backlog in the metadata tables, ran sp_mergemetadataretentioncleanup manually to force a cleanup and it returned immediately with 0 rows cleaned.
 > Ran it repeatedly and that did not help.
 > So, its obvious that the merge agents (Note: merge agents run retention based metadata cleanup as the first step during regular sync) were also not cleaning anything from those tables.
 > Further looking into the code of this proc (by running sp_helptext sp_mergemetadataretentioncleanup) found that it checks if any other agent is running sp_mergemetadataretentioncleanup (of it this SP is being manually on the server). If yes, it just skips execusting any code and returns.

-- if somebody else is already cleaning up in this database, we simply return
    set @applockname= 'MS_sp_mergemetadataretentioncleanup' + convert(nvarchar(11), db_id())
    exec @retcode= sp_getapplock @Resource= @applockname, @LockMode= 'Exclusive', @LockOwner= 'Session', @LockTimeout= 0, @DbPrincipal = @DbPrincipal
    if @@error <> 0 or @retcode < 0 return (0)

 > From DBCC OPENTRAN, found that there was one merge agent which was running sp_mergemetadataretentioncleanup and taking long time but the user might have hit cancel and it was hung while rolling back.
 > This was causing other executions of sp_mergemetadataretentioncleanup to not work and also causing the metadata to keep getting stale causing further performance (and blocking issues)
 > Killed that spid, and stopped all the merge syncs and ran sp_mergemetadataretentioncleanup manually.
 > This took long time as it always cleans the rows from metadata tables in a batch of 5000 rows. Also it checks expired subscriptions (using sp_MSmark_expired_subscriptions) and this can take some time when you have thousands of subscribers.
 > The complete execution took long time but it cleaned millions of rows from metadata tables. There were only a few hundred rows left in the metadata tables after this.
 > After this, the merge queries ran instantly and there was a almost no blocking.

If you come across a situation where you find sp_mergemetadataretentioncleanup is not cleaning up rows from metadata tables, it will be worth to: stop all merge agents, confirm that no spids that might be executing any merge metadata commands are lying around.
After this, run sp_mergemetadataretentioncleanup manually and wait till it finishes completely. Once it finishes completely, check the table count. After this, start the merge agents again.