Thoughts about Multi-Master Replication of Tree-Structured Data
I've been enthusiastic about ideas around independent updates to trees of data (e.g. XML or XAML documents) for almost five years now. I'd like to share some of my thoughts... Yeah, I spend a lot of time thinking about weird stuff.
Sorry it's such a long post... I'm sure this violates some rule of blogging but what the heck... Also, sorry it took so long. I got mired up in how to conclude it.
First of all, I'm considering what it would mean to share XML documents which are, of course, basically trees. Yeah, you can use XPATH to make second-class "other" links so that a document may be a full graph but they are essentially trees. Also, from my perspective, I think of XML as an Infoset which basically means the semantics of the tree matters but not the representation (which can be an arbitrary binary format or even-if-you-must, angle-bracket laden ASCII). The key features here are that the data is structured as a tree and the document has a unique identifier within a widely shared namespace (more on this below).
Assume a mechanism for plopping an XML document in a widely shared namespace. URIs are a fine example of a widely shared identifier for a document. Now, assume multiple systems can suck replicas onto themselves. We are considering a system in which it is perfectly acceptable to make a change on a replica and those changes will eventually filter to the other replicas. Anyone may just whack on their copy without coordinating with the other replicas. This allows programmatic usage patterns which have eventual consistency. Please note that, while I am a bit nutty, I actually do know this will not work for all usage cases. Von Neumann architectures are how we base most of our current computing and, believe you me, people can make a big mess of them (non-useful crap) or they can build useful crap. So, for now, don't worry about the programmatic implications, how to use it, and the ways to abuse it... bear with me... Let's consider the mechanism and how it behaves when stimulated (we'll talk about usage below).
1) The Proposed Mechanism for Multi-Master Replication of Tree-Shaped-Documents
When Replica-R1 starts whacking on document D1 under transaction T1, it changes multiple parts of the tree and then commits. You can pop this set up and find the lowest root of the subtree of the document containing all the changes made to D1 at R1 by T1:
Furthermore, it is quite viable to have the local replica (R1) modify the document-tree to include both SPATIAL and TEMPORAL information within its bushiness. So the new representation of the replica contains the original version (VO) and the new version (T1 as created at R1 as descended from V0). A couple of comments on this:
- Version V1 refers to the version of the subtree. At this point in the discussion, there is no difference between the version of the subtree and the version of the whole document-tree. We will get to cases coming up where there IS a difference.
- Version V0 is the entire tree (assuming V0 created the entire document-tree back at the beginning of time).
- The concurrent management of multiple versions of the tree means that the program must (implicitly or explicitly) declare what version it wants to examine and then it can navigate the tree as represented at that point in time (well really, at that point in time and "replica-visibility" since there will be different opinions across different replicas).
- Note: Everything about this model assumes subjective examination of data. At all times, you are AT a replica and only know what the replica knows. Time takes on a different meaning. Each replica has seen potentially different sets of changes made at other places. The only deep-seated rule is that IF you know about a change made at a different replica, THEN you know all the versions THAT OTHER replica knew at the time it made the change... Knowledge exchange is in version-parent order.
- Note that the precise definition of the new version is "T1 as changed at R1 and descended from V0". It is essential that the new version-id represents who (which replica) made the change and which parent versions are in the ancestry of this version.
1.1) Trees of Versions over Trees-of-Data
When updates are serializable, it appears as if every transaction is preceded by a crisp set of other transactions. You can think of the state of a database as a version and you can think of each version of the database as having exactly one parent. Version V1 is followed by V2, V3, V4, ... As you attempt to rationalize what to do in looking at a database (or in our example, a tree-shaped data structure), it is very clear what to do. You look at the state of the document-tree in Version-Vn and you whack on it within a transaction. When the transaction commits, you have created Version-Vn+1.
For this reason, I consider a single-master update behavior (i.e. serializable within the document behavior) to cause a version history that is a list. Each version has exactly one parent. Each parent has exactly one child (well... the latest version has no children but you get the drift...).
Now, let's consider what happens when you allow independent updates. When this occurs, you can begin to get more than one child of a given version. If I am on Replica R-1 and you are on Replica-R2 and we independently update version V-1, we now have two sibling versions (as soon as the knowledge of the concurrent changes gets communicated). This has created a tree for the version history.
OK, so now you can see how we get version-trees.
1.2) Directed-Acyclic-Graphs of Versions (over Trees-of-Data)
Below, we will describe some mechanisms by which you can ask for two sibling (or other non-ancestor and non-descendent) versions to be replaced by a new version. For now, let's just assume that you can take two sub-tree versions emanating from the same node-in-the-data-tree and cause them to be replaced by a third sub-tree emanating from the same node-in-the-data-tree. So, Version-X and Version-Y (of the same sub-tree) get replaced by Version-Z and this is all performed by transaction T3.
So, a given subtree (in the data-space perspective) can have a complex (but ordered) history of ancestry that results from disconnected updates and reconciliation of the divergent versions. In this way, changes are made independently and then conflict cleaned up later.
I do know that this is just a sharp and dangerous mechanism that can lead to horrible and awful messes. So is a Von Neumann machine! We will discuss below some usage patterns that result in nice behavior after we finish outlining the proposed behavior of the plumbing.
1.3) Vector Clocks as Version Numbers
We have three interesting requirements for the version numbers used in this scheme:
1) They can be uniquely generated by independent Replicas
2) Given two version numbers, you can easily determine their relationship/ancestry:
a) The first is the ancestor of the second,
b) The second is the ancestor of the first, or
c) Neither is the ancestor of the other (e.g. siblings or cousins or distant cousins)
3) Independently merging two versions will predictably result in the same new version-id for the merged version
As we shall see, there are going to be mechanisms for (sometimes) declaratively specifying what the contents of a tree would look like when two independently created versions are merged together. This will mean that the contents of the merged versions can be predictably calculated by any replica (we'll see this below). We also want to ensure that this identical result is identically labeled with the same version-id.
These are solved nicely by using a Vector Clock. Version-Ids created as vector clocks comprise a set of [replica-id, version#] pairs. Whenever a replica R(x) actually performs an update to an older version of the document, a new version-id is generated in which the Version# within the pair [R(x), Version#] is larger than any other version number for R(x). Since it is that replica which is creating the new version-id, this is easy to guarantee.
Consider the following scheme:
- Case 1: The Initial Creation of a Document-Tree at Replica-J
A vector-clock-id is created with both the name of the creating replica and a "1"
- Case 2: Subsequent Modifications Are Made at Replica-K
So, after R(J) makes version#7, replica R(K) updates that version. The new version has a vector clock annotating the latest update at R(J) and the latest update at R(K).
Replica: R(J) Replica: R(K)
Version# 7 Version# 1
Consider sequential updates as, one-at-a-time, the different replicas make changes. As time goes on, the version numbers increase for each of the replicas. Next, let's consider what happens when separate updates occur which cause the versions to diverge (i.e. there is NOT an easy hand-off of the baton with sequential updates).
- Case 3: A Separate Modification is Made at Replica-L based on the original version made at R(J).
So, R(L) has seen this document shortly after it was first made at R(J) -- say version#3. Based on this, it has whacked away to make two changes at R(L). This version is not aware of the independent changes made at R(K) nor is it aware of versions 4 through 7 made at the original replica, R(J).
Replica: R(J) Replica: R(L)
Version# 3 Version# 2
- Case 4: An Explicit Merge of the Changes from Replica-K and Replica-L Is Made at Replica-M
So, now the versions from replica R(K) and replica R(L) bump into each other and are unified into a single new version incorporating the 7th change at R(J), the first change at R(K) and the 2nd change at R(L). This yields the following version-id:
Replica: R(J) Replica: R(K) Replica: R(L) Version# 7 Version# 1 Version# 2
Again, the interesting points about the use of vector clocks are that they can be independently and uniquely generated and you can determine and manage ancestry. The bummer about vector-clocks is that the size of the ID grows as the number of replicas performing changes increases. Still, they are the coolest version scheme I am aware of.
1.4) Programmatic Merging of Schismatic Versions
Let's consider what happens when a schismatic pair of versions are independently created at different replicas. So these are sibling (or other non-ancestral) versions.
OK, so one way that two schismatic versions [like Version-V(X) and Version-V(Y) above] can be replaced with a reconciled version is by explicit programmatic calls.
So, a program running on Replica-R(M) can say: "I am aware of Version-V(X) and of Version-V(Y) and I am explicitly replacing them with Version-V(Z)". This is called Programmatic Merging. In this case, the vector-clock for the version-id would have a unique version-number slapped on it for Replica-R(M). In other words, the Version-ID [Version-V(Z)] would look just like any other update, not treated specially because it is a reconciliation. Also, it would be clear from examining the version numbers that both Version-V(X) and Version-V(Y) are ancestors of the new Version-V(Z), just like in the picture above.
1.5) Trivial Merging of Schismatic Versions
When two schismatic versions are IDENTICAL (i.e. the bits in each modified subtree are the same bits), we can automatically reconcile these independently created changes. So, if Version-V(X) and Version-V(Y) are independently created to modify the same contents of the document in exactly the same way, let them merge. The data value for the tree is obvious and the version-numbers are automatically converged at any replica that sees the changes. Note that each of these versions is an ancestor of the merged version (and that is easily detected by the vector-clock). Identical updates will rapidly settle out across the collection of replicas.
Note that this property is really cool because the work on a server pool is idempotent (to the extent the results are identical). You only want to STATISTICALLY perform the work once and guarantee you perform it at least once. The mutually agreed answer will settle out (as long as it has the same data value). We will discuss some usages (and constraints) on this below.
1.6) Declarative Merging of Schismatic Versions
An additional option is to Declaratively Merge divergent versions. Here's the idea:
- Assume there is a schema for the tree contained in the document. I am not hung up on if it is an XSD schema or some other form of schema. We're just going to want to decorate some nodes in the hierarchy and have a way to store and manage that decoration.
- Optionally decorate some siblings as automatically reconcilable. For example, If Joe works on Chapter-J of a book on Replica-J and Sam works on Chapter-S of the same book (in the same document) on Replica-S, it is possible to declare separate chapters are automatically reconcilable.
- When independent changes (made at separate replicas) are made to disjoint sub-trees that are annotated as automatically reconcilable, when those changes come together, a unified version can be independently created (with its own version-id). This new version can be calculated at many different replicas with a predictable outcome.
- So, the contents of the merge can be independently calculated at a random Replica-R and the version number can be independently calculated via the vector-clock.
This means these declaratively merged versions will rapidly settle out in a network of replicas. The two updates (which are independently merge-able) will propagate independently through the set of replicas without any anomaly.
It is the ability to define schema for siblings that are automatically reconcilable that makes trees such a cool foundation for replicated data. Related to this is the notion that a platform (which understands the schema and the reconciliation mechanisms) is implemented separately from the application. The platform does these tree-and-schema-decoration-stuff. The application specifies the schema and when/where/how siblings may be declaratively merged. This declarative merging is application specific (i.e. varies across apps).
1.7) Bushiness Challenges when You Don't Use Declarative Merging
If there are no stopping points for the version merging, independent updates can rapidly rise up to encompass vast swaths of the tree. What is the root of the subtree being versioned? It is the lowest root that spans both the transaction making the change and the versions being superceded. This is needed to cope with independent updates of different granularities being coalesced.
As you can see above, the size of the subtree contained by a change may have to grow to compare it to another potentially conflicting change. In a wildly replicated environment, many different subtree-scopes may occur as different replicas see different sets of updates. It does settle out to the same answer but the tree can grow quite a bit. Furthermore, attempts to programmatically reconcile may need to incorporate larger collections of changes that have occurred independently.
This dilemma is dramatically reduced in the presence of schema-annotated stopping points across which declarative merging is performed.
1.8) Eventual Accretion of Shared Knowledge
Think of any delta to the document created by any replica as a unique factoid. If you haven't heard this new factoid, it is news to you and you add it to the document. If you have heard the news, it ain't news...
Each time replicas share knowledge, they compare notes: "I'll tell you what I know if you tell me what YOU know...". In each case, it is the set of factoids that are compared and new ones shared with the other replica.
In this fashion, the set of factoids sloshes around the set of replicas. In the absence of new change, the knowledge (i.e. the collection of factoids) shared by the set of replicas will be the same after everyone compares notes.
This has the LOVELY property of being idempotent. It never hurts to say what you've learned to someone else... If they've heard the news, no harm.
2) Usage Patterns for Applications
It seems to me that all interesting uses of multi-master replication have SOME convention by which different machines whack on different portions of the shared data. This happens with high probability (via some agreement) or the collected system lives with some inconsistency in the outcome. You see "last-writer-wins" and other similar (draconian) reconciliation policies as being accepted because the USERS of these systems do their darndest to ensure that the different replicas don't step on each other's toes.
2.1) Roles and Disjoint Subtrees
The more I've thought about this the more I believe that there is a notion of a ROLE within the application that is semi-statically assigned to a single replica. I say semi-statically because I may have the role of USER:PAT and work on my laptop. Later, I continue working on my desktop. Even later, I do more stuff on the computer at my brother-in-law's house. Assuming reasonable connectivity, there is only one updater at a time for that role. In another example, Purchase-Order-P1 is updated on Replica-R(X) who is assigned as the primary server for that purchase-order. When Replica-R(X) fails because a squirrel chews through a power cord, the backup server with Replica-R(Y) steps up to the plate and assumes that role.
It is important to recognize that this doesn't just plain work for anybody whacking on anything. I was trained in the transactional read-set/write-set thinking and this would be a complete disaster if anybody just changed anything... I thought the World Wide Web would be a disaster, too, because it wasn't transactionally consistent. Well, I was wrong (just a little...) and now I am trying to see the patterns that cause success... People make shared access to replicas work by applying conventions which partition the data into sets which are updated by roles which are semi-statically defined.
2.2) Patterns Used by the Roles
So, for each role's data, only that role will write that portion and everyone can read it. This is done on a semi-static basis. As descibed above, the replica implementing the role may change in a (highly-probably) semi-static fashion. Also, portions of the shared document may be formally transferred from one role to another in a controlled fashion.
Here's a few obvious uses:
Here's the subtree that Pat writes messages into. Here's the subtree that Joe writes messages into. Here's the subtree of messages from Pat directed exclusively to Harry.
- My Private Data
Stuff that only my role whacks on. There are two good reasons for keeping this replicated. First, it automagically supports takeover (assuming some sane strategy for selecting which replica is working on this chunk of data). Second, it can coalesce related data into a single data-structure. This is very much like the paper forms we've all seen with sections marked as "For Office Use Only" on the little square on the bottom of the page... It's really nice that the "Office Use Only" information moves around with the rest of the paper. The downside is that the information is not really private other than by convention.
- One-at-a-Time Shared Data
In this approach, the replicas explicitly coordinate the transfer of "Writeability" between the roles via messaging. Imagine a message in the Pat-Sending-Stuff-to-Harry subtree which means: "OK, Harry, you can work on the subtree containing the Purchase-Order now. I'll wait until you say it's my turn."
This is where the guarantee of the ordering of versions is essential. If Harry can read the: "OK, Harry..." message in the XML document, he can see everything Pat saw in that specific document at the moment the version containing the message was committed at Replica-R(Pat).
2.3) Homage to Our Parents -- The Impact of the "Paper-Form Model"
I am always fascinated by the fact that whenever I am confused about how to make something work WITH a bunch of computers, I look to understand how it works with a bunch of companies or departments WITHOUT a bunch of computers.
In the old days, each office worker had an IN-BOX and an OUT-BOX. Paper forms moved around the departments. You never erased anything from those forms (see Accountants Don't Use Erasers). Sometimes, the forms were copied and each copy had annotations added but when that occurred, you made sure all the copies (with their annotations) were attached together and placed into the file for the unit-of-work in process. Forms connected the departments and launched internal processes. These internal processes were cross-referenced to the incoming work that stimulated them (with reference numbers to access folders in a file cabinet).
This scheme WORKED. There was no "two-phase commit". Everyone made local decisions based on the paper sitting on their desk... No one used erasers.
Example with Paper Forms: Buying a House
Consider a house purchase (at least as I am familiar with the process in California and Washington State). The buyer and seller negotiate with offers and counter-offers each of which expires within a few days unless a signed acceptance is received. This opens a window (typically a month long) in which the house is "under contract" and the buyer is typically scurrying around trying to get financing.
Now, everybody trusts the escrow company. Consider some of the parties behavior:
- The buyer coughs up the down payment, signs away an ungodly amount of money to be paid each month, agrees to be liable if some doofus breaks an arm on their property, and (hopefully) gets to move in a few days later. When they sign up, they don't KNOW they will own the house. They believe (with good reason and high likelihood) that they will EITHER get the house and its obligations OR they will get their money back and not be in debt. Only a day or two later do they find out which... the buyer should wonder and worry until they hear the outcome.
- The seller hands over the title to the house conditioned upon the receipt of sufficient funds and a few other details. They know that EITHER they get the money and any existing loans will be paid off (and give up the property) OR they will keep the property and nothing changes (I actually had this happen to me once selling a home). The seller should lay awake at night wondering if it will really happen after signing the papers.
- The lender ships over a ton of money to the escrow company and trusts the escrow company to EITHER get the loan docs signed, ensure the other obligations on the property are paid off, and ensure the lender has its encumbrances recorded on the title OR they get their money back. The lender won't worry much (because they'll find a different way to make money if the deal pops and they get a refund) but they still don't know if the deal will happen or not (I walked away from an escrow signing once because the terms of the note were different than I had been led to believe... the loan company got their money back and nothing for their inconvenience but they were the buttheads that misrepresented the terms of the deal).
There are more complexities to this but the summary is that EVERYONE trusts the escrow company and specifies their EITHER-OR rules. The escrow company has to ensure that ALL the requirements of all the participants are met or the deal is off.
What is interesting about this example is that there are file-folders of both original documents and copies of the documents. Eventually, all the interested parties are provided with ALL the documents. The order in which they are written is fascinating, too. Some of the documents are written by the buyer and then either an original or copy signed by the seller. In some jurisdictions, an original deed-of-trust must be physically signed and notarized by both the seller and buyer. I've done a few transactions where the original (with the other party's signature) must be FedEx-ed back-and-forth to get the same original signed by parties in different states. There doesn't seem to be much confusion over concurrency control w.r.t. the file folders of documents. Everybody follows conventions that allow all them to take independent actions and send copies to the other parties (and sometimes that original is handled with special treatment and filed at the county). Everyone ends up with a complete set of copies of everything related to the sale of the home and the transfer of title. The folder operates just like the multi-master document being described herein.
Notice that each of the uses listed above will be apparent in the file folder. Messaging occurs as coordination letters are sent (these are independently written). Settlement statements are generated by the escrow company then handed to each party for initialing or signing. Other documents are supplied by buyers, sellers, and escrow without needing modification by anyone else.
You can shift escrow officers (people within the same escrow company) by handing over the file-folder as the escrow-role is semi-statically transferred to another escrow-officer. If the escrow-officers do their work in a functionally-predictable fashion, it actually doesn't matter if two escrow-officers independently generate the same document. When this is detected, the two will be compared to ensure they are identical; if they aren't an analysis is made to see what new document should take its place. Assuming predictable behaviour, the redundant work is trivially-reconciled as described above. Typically, the file-folder within the escrow-company is only modified by one escrow-officer at a time (or the allocation of work is predictably partitioned).
Note that the buyers, sellers, lenders, real estate agents, and more all have other folders in their lives. The buyer may be maintaining an investment portfolio and information about this property is transmitted (copied) into another folder summarizing this properties acquisition costs. There will be a cross-reference to the folder about the purchase. The lender will have a folder tracking the state of the loan, its payments, and its fluctuating value on the secondary markets. This would be an independent folder referencing the folder about the creation of the loan. All these related business folders would be modeled as separate documents in the scheme outlined herein.
Basically, I think our ancestors were smart... We are the ones getting confused when we tried to do distributed transactions and pretend all this doesn't need to happen. Still, the model for accumulating related changes coming from different participants (and letting everyone see the accumulated folder) seems to model the real world.
2.4) Tying to "Outside Data"
A while back, I wrote a paper called Data on Outside versus Data on the Inside. In this, document, I point out the difference in how data is treated when sent around in message ("Outside Data") versus how it is treated when completely contained within the transactional scope of a database ("Inside Data"). All these usage patterns are outside data.
2.5) A Bit about Scalability
The trick about scalability is keeping your decisions ALWAYS on a single machine. To do this under repartitioning, you need to think about the granularity of the data. In my paper on Life Beyond Distributed Transactions, an Apostate's Opinion, I posit the need to break the data into "entities" which have a unique key. These entities are the scope of a transactional boundary.
In the scheme outlined in this massive missive, the "roles" which are amorphously described above take on the job of the entity. By application convention, updates are only done to a portion of ANY of the related documents by a single role. The "order-processing" role may work on the "order-processing" portion of the purchase-order at the same time as it works on the "order-processing" portion of three shipping-requests. The purchase-order is shared with the customer. The shipping-requests are shared with the "shipping-system". Customers and shipping-systems write to different parts (because of their roles) than the order-processing system.
Actually, it is more correct to say the "role-combined-with-entity" aligns with the entity notion in the paper. The order-processing service will partition its work by order-id. The work for order-id-x will span multiple documents which are shared with different services in this complex composition of work.
We've discussed (in WAY too much detail and, also, WAY too little detail) an idea for how we can view multi-master updated tree-structured documents as the connection between computational components. Arguably, this provides real value in offline, replication, and many other perspective of connecting the computation of a distributed system.
The notion of a "document" describe herein is seminal to this working. Within the bounds of a single document, a change generated by replica-R(X) by transaction-T(X) cannot be seen by any other replica unless all of the changes within that document visible to transaction-T(X) are also visible to the other replica. This is not the same as a generalized causal ordering and is much easier to implement. Specifically, changes to different documents do not have ordering guarantees across the documents (even if the changes are created by the same transaction at the originating replica).
-- I'll write more (shorter) blog entries soon. - Pat