3.1.9.1.2.4 Compressed Segment Trailer

The bit stream can end with some number of unused bits (0-7) in the last byte, which MUST NOT be decoded. (Attempting to decode can overrun input and produce too many bytes of output). The value of the last byte in the compressed segment indicates the number of unused bits in the final byte (some value between 0 and 7, inclusive). The five high-order bits in the last byte of the compressed segment are reserved.

For example, if the encoding of a stream produces 217 bits, the stream is 29 bytes in length. The first 27 bytes plus the most-significant bit of the 28th byte comprise the bit stream. The 29th byte has the value 7, indicating that 7 bits (of the 28th byte) are to be ignored. The total length of a segment's bit stream is:

NumberOfBitsToDecode = ((NumberOfBytesToDecode - 1) * 8) - ValueOfLastByte.

There is no "end of block" token or other marker. The decoder MUST stop when this number of bits has been decoded.

Huffman symbols or "tokens" are defined for:

  • A literal (a single byte to be output).

  • A match, including the distance back into the history from which to copy.

  • An unencoded sequence, introducing some number of bytes to copy directly from input.

Most literals are encoded with a "0" prefix, followed by 8 bits containing the byte to output, most-significant bit first. Some selected literals MUST be represented using a shorter token varying between 5 and 8 bits, all beginning with a "11" prefix. The nine-bit encodings that would otherwise represent these literals are reserved and MUST NOT be used to encode these literal values.

A match token is followed by some number of bits indicating the number of bytes output since the needed bytes or the "distance" backward. Each token has been assigned a different base distance and number of additional value bits to be added to compute the full distance. Additional value bits are presented most-significant bit first. A match length prefix follows the token and indicates how many additional bits will be needed to get the full length (the number of bytes to be copied). Most of the match length prefixes have been defined so that a decoder can simply count the number of "1" bits until a "0" bit appears to determine how many value bits follow.

The distance is not a buffer offset, but instead indicates the number of bytes that have been output since the first of the bytes to be copied. A linear buffer is often used to record recent history, with a "cursor" indicating the buffer offset where the next byte will be placed, wrapped around to the beginning of the buffer when the end is reached (also known as a "ring buffer"). With this approach, the distance can be subtracted from the cursor offset, while compensating for any buffer wrap-around, if applicable, which might have occurred since the needed bytes were decoded.

A match distance of zero is a special case, which indicates that an unencoded run of bytes follows. The count of bytes is encoded as a 15-bit value, most significant bit first. After decoding this count, any bits remaining in the current input byte are ignored, and the unencoded run will begin on a whole-byte boundary. The ignored bits, plus 8 bits for each unencoded byte, are also considered part of the total number of bits in the input stream. If any bits remain after an unencoded run of bytes, decoding continues with the most-significant bit of the first byte following the run.

The following table contains all the defined tokens. Any token or bit sequence that is not defined in this table is reserved.

Bit Prefix

Decimal

Value Bits

Purpose

0

0

8

Literal xxxxxxxx (excluding those literals with shorter codes described in this table for which their 9-bit representation is reserved)

10001

17

5

Unencoded literal sequence (10001 00000), or               

Match distance 1...31 (10001 xxxxx)

10010

18

7

Match distance 32...159

10011

19

9

Match distance 160...671

10100

20

10

Match distance 672...1695

10101

21

12

Match distance 1696...5791

101100

44

14

Match distance 5792...22175

101101

45

15

Match distance 22176...54943

1011100

92

18

Match distance 54944...317087

1011101

93

20

Match distance 317088...1365663

10111100

188

20

Match distance 1365664...2414239

10111101

189

21

Match distance 2414240...2500000

11000

24

Literal 0x00 (000000000 is reserved)

11001

25

Literal 0x01 (000000001 is reserved)

110100

52

Literal 0x02 (000000010 is reserved)

110101

53

Literal 0x03 (000000011 is reserved)

110110

54

Literal 0xFF (011111111 is reserved)

1101110

110

Literal 0x04 (000000100 is reserved)

1101111

111

Literal 0x05 (000000101 is reserved)

1110000

112

Literal 0x06 (000000110 is reserved)

1110001

113

Literal 0x07 (000000111 is reserved)

1110010

114

Literal 0x08 (000001000 is reserved)

1110011

115

Literal 0x09 (000001001 is reserved)

1110100

116

Literal 0x0A (000001010 is reserved)

1110101

117

Literal 0x0B (000001011 is reserved)

1110110

118

Literal 0x3A (000111010 is reserved)

1110111

119

Literal 0x3B (000111011 is reserved)

1111000

120

Literal 0x3C (000111100 is reserved)

1111001

121

Literal 0x3D (000111101 is reserved)

1111010

122

Literal 0x3E (000111110 is reserved)

1111011

123

Literal 0x3F (000111111 is reserved)

1111100

124

Literal 0x40 (001000000 is reserved)

1111101

125

Literal 0x80 (010000000 is reserved)

11111100

252

Literal 0x0C (000001100 is reserved)

11111101

253

Literal 0x38 (000111000 is reserved)

11111110

254

Literal 0x39 (000111001 is reserved)

11111111

255

Literal 0x66 (001100110 is reserved)

Match tokens are followed by a length token:

Bit Prefix

Value Bits

Definition

0

Length 3

10

2

Length 4...7

110

3

Length 8...15

1110

4

Length 16...31

11110

5

Length 32...63

111110

6

Length 64...127

1111110

7

Length 128...255

11111110

8

Length 256...511

111111110

9

Length 512...1023

1111111110

10

Length 1024...2047

11111111110

11

Length 2048...4095

111111111110

12

Length 4096...8191

1111111111110

13

Length 8192...16383

11111111111110

14

Length 16384...32767

111111111111110

15

Length 32768...65535

A single compressed segment MUST NOT translate to more than 65,535 uncompressed bytes.