1.3 Overview

Remote differential compression (RDC) is a compression algorithm that is used in other protocols, but it is not a protocol itself. For example, it is used by the SD Microsoft Distributed File System Replication Protocol [MS-FRS2]. A file is a typed data stream. For the purposes of this document only, file does not imply storage of the data stream in any particular medium or with any particular organization, or, for example, in a file system (italic is used when referring to traditional files).

Traditional compression algorithms operate by finding some kind of redundancy in the file to be compressed, and then represent it more compactly. For example, the Lempel-Ziv algorithm (for more information, see [UASDC]) finds repeated substrings in the source file and replaces second and successive instances of them with references to the initial string in the compressed file. By contrast, RDC tries to find blocks of data that are identical between the source file (on the source location) and a set of seed files (on the target location). This makes it possible for the protocol that is transferring the file between two locations to reduce the number of bytes transferred between the source location and target location when reconstructing the source file on the target location.

The fundamental observation that drives RDC is this: By dividing a file into chunks and determining what chunks are already present at the target location, a correct copy of a source file can be constructed at the target location. To accomplish this, it is only necessary to transfer those chunks not already present at the target location, and combine them in the proper order with those chunks that are already present at the target location. If many chunks of the source file are already present at the target location, this technique can greatly reduce the amount of data that is required to be transferred compared to transferring the entire source file from the source location.

The set of chunks that make up a file can be expressed in far fewer bytes than the entire file by means of a signature file. The signature of a chunk is its length and a collision-resistant hash of its contents. Because hashes produce small, fixed-size outputs, for even modest-size chunks, the signature will be much smaller than the chunk.

Any method of determining chunk boundaries will suffice to produce correct behavior. However, because it is common to create a file by inserting or deleting bytes from a previously existing file, or by combining several previously existing files, it is advantageous to have a chunking method that can find equal chunks at some point after the insertion or deletion. A necessary consequence of this property is that the chunking method is required to be data dependent. It cannot simply use fixed offsets.

The RDC chunking method computes a rolling hash over the data using the H3 hash function, as described in section 3.1.5.1.1, and selects as chunk boundaries the offsets that have the largest value of H3 within a given horizon. Because this computation depends only on the bytes within the horizon (plus possibly one hash window size of bytes before the horizon), any change outside of the horizon will not affect the chunk boundaries. Thus, the chunking will produce identical chunks sufficiently far away from the change, including insertions and deletions.

This idea was the basis of MIT's Low-Bandwidth Network File System (LBFS) (for more information, see [LBFS]), although LBFS differs from RDC in many details.

RDC can be used within a single machine or to optimize transfer between two machines. For example, RDC can be used as part of a network protocol that engages in a handshake of the form:

  1. Machine A sends Machine B a request to transfer file fB.

  2. Machine B partitions file fB into chunks and computes the signature for each of the chunks. Machine B sends its list of chunk signatures, SigB1 ... SigBn, to Machine A. This list provides the basis for Machine A being able to reconstruct file fB.

  3. Using similarity data, as described in section 3.1.5.4, Machine A selects a file fA as a seed file and partitions it into chunks. It computes a signature for each chunk, as described in section 3.1.5.2.

  4. As Machine A receives the chunk signatures from Machine B, it compares the received signatures against the set of signatures SigA1, ..., SigAm that it has computed on its seed file (that is, file fA) in step 2. As part of this comparison, Machine A records every distinct signature value it received from Machine B that does not match one of its own signatures SigAk computed on the chunks of file fA.

  5. Machine A sends a request to Machine B for all the chunks whose signatures were received in the previous step from Machine B, but that did not have a matching signature on Machine A. The chunks are requested by offset and length in file fB based on corresponding information that was sent in step 2.

  6. Machine B sends the contents of the requested chunks to Machine A.

  7. Machine A reconstructs file fB by using the chunks received in step 6 from Machine B, as well as its own chunks of file fA that matched signatures sent by Machine B in step 4. After this reconstruction step is complete, Machine A has completed the transfer of fB.

RDC takes advantage of the fact that a list of signatures can be treated as a binary large object (BLOB), and therefore can be treated as a signature file, which is amenable to RDC. Thus, a recursive application of RDC is possible by feeding the signature file obtained in step 2 to RDC. This is useful in cases in which the file to be transferred is so large that even its signature file is large relative to the set of bytes that have changed.