2.1 LZ77+Huffman Compression Algorithm Details

Article
04/23/2024

The overall compression algorithm for the Huffman [IEEE-MRC] variant can handle an arbitrary amount of data. Data is processed in 64k blocks, and the encoded results are stored in-order. After the final block, the end-of-file (EOF) symbol is encoded. Each 64k block is run through three stages, which are performed in this order:

Perform LZ77 ([UASDC]) compression to generate an intermediate compressed buffer.
Construct canonical Huffman codes.
Process the intermediate LZ77 data, and re-encode it in a Huffman-based bit stream.

The algorithm cannot start Huffman encoding until it has computed the Huffman codes, and it cannot compute the Huffman codes until it knows the frequency of each symbol in the Huffman alphabet. To compute these frequencies, the algorithm first performs the LZ77 phase. For efficiency, the algorithm SHOULD store the LZ77 output so that the final phase does not have to recompute it.

The final compression format consists of two parts:

The first 256 bytes indicate the bit length of each of the 512 Huffman symbols (see prefix code).
The remainder of the data is a sequence of Huffman symbols, along with match lengths and distances.

The Huffman alphabet consists of 512 symbols, each with a numeric value in the range 0-511. The symbols 0-255 represent literal values that correspond to raw byte values as opposed to matches. The symbols 256-511 represent matches or references indicating that the next several bytes are the same as some bytes that previously occurred in the data. Each match consists of two encoded integers: a length and a distance. When the decoding method encounters a match symbol, the original data is reconstructed by copying <length> bytes from the position in its previously decompressed data of <[decompression cursor] – [match distance]>.

2.1 LZ77+Huffman Compression Algorithm Details

Additional resources