2.6.1.1.1 SymbolCategory Structure

Every Symbol Category structure defines a set of symbols. All symbols are assigned a value in order from 0 to the total number of symbols minus 1, starting from the first category and ending with the last category. In every category, the smallest symbol value is the Base symbol value. Therefore, the Base symbol value of the first category is 0, and the Base symbol value for every other category equals the Base symbol value of the previous category plus the number of symbols in the previous category.

For every category, all symbols with values greater than or equal to the Base symbol value plus DocIDDelta value threshold are category special symbols. The category special symbol with the smallest value is the first special symbol.

The Coding table array in the ExtensionCompressionTablePage structure stores the code bit sequences for all symbols in order of increasing value.

For every item containing the content index key the DocIDDelta value is encoded using the defined symbols in the DOCID bit stream field and the corresponding OccCount or MaxOccBucket is stored in the OccCount bit stream array in the ExtensionDataPage, as specified in section 2.6.1.2. The BitsUsed value in the symbol category structure is the number of bits used to store the corresponding element in the OccCount bit stream array.

For every non-special symbol the corresponding DocIDDelta value equals the difference of the symbol value and the Base symbol value. For special symbols, the DocIDDelta value is stored after the symbol bit sequence in the DOCID bit stream field using 16 bits for the first special symbol and 32 bits for other special symbols.

The format of a SymbolCategory structure is as follows.


0


1


2


3


4


5


6


7


8


9

1
0


1


2


3


4


5


6


7


8


9

2
0


1


2


3


4


5


6


7


8


9

3
0


1

Number of symbols

DOCIDDelta value threshold

BitsUsed value

Base symbol value

Number of symbols (4 bytes): The number of symbols in this category. This value MUST be equal to 0x00000082.

DOCIDDelta value threshold (4 bytes): DocIDDelta values greater than or equal to this threshold are replaced with a special symbol. This value MUST be equal to 0x00000080.

BitsUsed value (4 bytes): The number of bits used to record the corresponding element in the OccCountbit stream of the ExtensionDataPage. If this value is zero, the element is not stored in the array and its value is the same as the value for the previous document identifier.

Base symbol value (4 bytes): The base symbol value of category. This MUST be equal to the Base symbol value of the previous category plus the Number of symbols field in the previous category (zero for the first category).