Microsoft Sync Framework, Part 2: Sync Metadata

Sync metadata is the cornerstone of Microsoft Sync Framework just like it would be of any other sync solution. The reason why we need to have the sync metadata is obvious -- we need to track data changes and detect conflicts. The ability to track data changes (or change detection) is based on simple comparison of current data state against a previously recorded data state. Similarly, the ability to detect conflicts is based on comparison of the current state of the data against the state of data being applied to the local data store.

Sync versions

Microsoft Sync Framework associates a data state with a version. The version is essentially a tuple consisting of a sync endpoint Id which made a change and logical clock of the time when the change was made. The logical clock is the replica's own clock which is monotonically increasing and thus we a have a guarantee that each subsequent change made to the data on that replica will have a logical clock value associated with the version greater than a previous change to the data made on that replica. For example, if we know that replica A made a change to an item I1 at the logical clock time 5, we record this as the last update version for item I1 as A5. I'll use this notation throughout other posts in this blog. Different replicas in the community can have different logical clock values and different rules of incrementing them (in other words, a logical clock values can be incremented arbitrarily as long as the ever increasing value requirement holds true).

Items can be changed on different replicas. For example, the latest change to item I1 can be done on replica A at its logical time 5 whereas the latest change on item I2 can be done on another replica B at its logical time 12. Therefore, when we receive changes from replica B we should be able to receive the change for item I2 if we don't know this change yet and after receiving of that change our local sync metadata will indicate that we know versions A5 for item I1 and B12 for item I2.

Introduction to the sync knowledge

Typical sync operations like change enumeration and conflict detection require comparison of a sync version for an item against other sync versions. When we do change enumeration, we need to compare sync versions of items on the source against the sync versions on the destination, and send the changes on the source for those items which destination doesn't know about. With the set of sync versions, the typical solution would be sending of all sync versions from the destination to the source. This approach is very inefficient - you need to send the number of versions proportional to the number of items which destination knows about. Clearly, we need to have an optimization.

The optimization here is that we compress all sync versions in the sync endpoint into the single compact data structure which we call knowledge. The knowledge encompasses all changes (in other words, versions of all items) which a particular sync endpoint knows about. It's important to understand that the sync knowledge used in Microsoft Sync Framework is very compact in the normal case - its size is proportional to the number of sync endpoints participating in sync, not a number of items synchronized. Sync knowledge is the great optimization of the sync metadata of a particular replica, however it doesn't completely replace per-item sync metadata. We still need to know when and where (by the virtue of a sync version which answers both questions) a given item was changed.

Tombstones

Just like the name suggests, the tombstones are used to track dead (or deleted) items. When an item gets deleted, the sync endpoint shouldn't simply get rid of an item, rather it should create a tombstone for it. The tombstone for an item has its own version assigned to it and is used to propagate deletions across the sync community. Clearly, over time tombstones can accumulate however luckily Microsoft Sync Framework supports the tombstone cleanup scenarios as well (more on those in subsequent topics). To store cleaned up tombstones, Microsoft Sync Framework suggests sync endpoints to store forgotten knowledge which tracks cleaned up tombstones (or in other words, forgotten deletes).

Per-item sync metadata

What sync metadata do we need to store per an item? First, it's the last update version or the sync version which is assigned by the endpoint which made the latest change to an item we know about. Second, it's the item Id -- we need to identify an item in the Microsoft Sync Framework. Third, we need to track item's creation version which is particularly useful during item resurrection scenarios. And finally, we need to be able to tell whether an item is alive or not.

Summary

So far, we've categorized the sync metadata which sync endpoints need to track into per-item sync metadata such as last update version and creation version, global sync endpoint metadata such as knowledge and optional forgotten knowledge, and tombstones which track deleted items. In some scenarios, tombstones can be stored along with the rest of sync metadata and just identified by a simple IsTombstone flag:

ItemId Last update version Creation version IsTombstone
I1 A5 B1 False
I2 C12 A5 True

This table represents what is commonly called a sync metadata storage. Microsoft Sync Framework offers built-in support for a sync metadata storage based on SQL Server Compact Edition.

In the next post, I'll delve into the sync knowledge and various operations it supports. Stay tuned.