Thursday, November 19, 2009

solving frequent replication distribution latency errors

We have many millions of commands a day passing through a 2-node peer-to-peer replication topology. At some point, replication lost its ability to keep up with distribution. After much research, we replicated the issue to Microsoft and were told that our distribution database was too big. They recommended that we set immediate_sync on the publication to false.

To me, it made no sense why this would help. We had replication distribution history latency set to 48 hours - wouldn't transactions be retained for that amount of time regardless of whether they'd been replicated or not?

Luckily, I found this MSDN blog post, which does a great job of explaining the interaction between min transaction history retention, max transaction history retention, and immediate_sync. Our plan now is to set min retention to 8 hours (we need a buffer that will allow us to initialize from backup), max retention to 36 hours, and immediate_sync to false. This will keep all transactions for 8 hours, but replicated transactions after that buffer will be deleted as soon as they're replicated to existing subscribers. Transactions will only be retained for 36 hours if replication gets seriously backed up again.
Post a Comment