3.1.5.1.1 H3 Hash

RDC uses a rolling hash function called H3 in determining its chunk boundaries. H3 computes the hash of the hash window bytes (denoted as windowSize below) up to and including a particular byte in the file. Because it is a rolling hash function, the H3 hash of a given range is defined in terms of the hash of the range starting and ending 1 byte earlier.

shift_amount is an unsigned 32-bit integer that depends only on windowSize, and is used in computing H3. The shift_amount is calculated as follows.

 shift_amount := 1
 i := 32
 while ((i > 0) AND ((windowSize MODULUS i) ≠ 0))
 BEGIN
       shift_amount := shift_amount times 2 
       i := the integer truncation (floor) of (i divided by 2)
 END
 shift_amount := shift_amount MODULUS 32

hash_valuei is an unsigned 32-bit integer that contains a calculated hash value. By definition, hash_value-1 is zero. Computing the H3 value for the windowSize byte ending at byte i depends on hash_valuei-1. For the purposes of H3, all data bytes before the beginning of the file are treated as having a value of 0.

trailingEdgeData and leadingEdgeData contain the data bytes (unsigned 8-bit integers) that are at the trailing edge and leading edge in a sequence of windowSize bytes. These data bytes become indices into the H3 lookup table for the purposes of the hash calculation.

To add leadingEdgeData to the hash value and remove trailingEdgeData (that is, to compute the H3 value of the windowSize bytes ending with the leading edge), compute the following.

Note Subscripts are expressed with parentheses (for example, "hash_valuei" would be expressed as "hash_value(i)").

  
 hash_value(i) :=
       rotate_bits_left(
             hash_value(i-1) XOR 
             H3 lookup table[trailingEdgeData] XOR 
             H3 lookupTable[leadingEdgeData], 
             shift_amount
       )

This becomes the new hash_value for moving the window 1 byte farther along.

Note that when processing near the beginning of the file (when the leading edge is less than the hash window), the trailing edge will actually be less than zero (before the beginning of the file). In this case, H3 considers this nonexistent data to be all zeros.

The following pseudocode shows how hash_value is computed for each offset i.

Note Subscripts are expressed with parentheses (for example, "hash_valuei" would be expressed as "hash_value(i)").

  
 hash_value(-1) := 0
 FOR i = 0 to min(windowSize-1, Length of the file - 1)
       hash_value(i) :=
        rotate_bits_left(
             hash_value(i-1) XOR 
             H3 lookup table[0] XOR
             H3 lookup table[File[i]], shift_amount)
  
 FOR i = windowSize to Length of File - 1
 BEGIN
       trailingEdgeData := File[i - windowSize]
       leadingEdgeData := File[i]
       hash_value(i) :=
        rotate_bits_left(
             hash_value(i-1) XOR 
             H3 lookup table[trailingEdgeData] XOR
             H3 lookup table[leadingEdgeData], 
             shift_amount)
 END

For an example of H3 hash calculation, see section 4.2.