Succinct data structures

Sorry for the long hiatus, everyone. Today I'm going to talk about succinct data structures, which are informally data structures that use very close to the absolute minimum possible space. This material is largely based on lectures by MIT professor Erik Demaine, as transcribed by 6.897 students Wayland Ni and Gautam Jayaraman (link). Also see additional notes from this year by students Yogo Zhou and Nick Harvey.

One big problem with classical data structures that is too often overlooked is how horribly inefficient their space usage is. For example, suppose you have a linked list of characters. Each node contains a character and a pointer; on a 32-bit platform, each pointer takes 4 bytes, so that's already 5n bytes for n elements. Additionally, it's unlikely with most standard allocators that the 3 bytes following the character will be used; these are normally reserved for padding for alignment. On top of this, unless some kind of memory pool is created for the nodes, most simple allocators will reserve 4 to 12 bytes of space per node for allocation metadata. We end up using something like 16 times as much space as the actual data itself requires. Similar problems appear in many simple pointer-based data structures, including binary search trees, adjacency lists, red-black trees, and so on, but the problem isn't limited to these: dynamic arrays of the sort implemented by Java and .NET's ArrayList classes can also waste linear space on reserved space for growing the array which may never be used.

A natural explanation to this state of affairs would be to say that all of this extra data is neccessary to facilitate fast operations, such as insertion, removal, and search, but in fact this isn't true: there are data structures which can perform many of the same operations using very close to the minimum possible space. What is the minimum possible space? For a list of data, it's simply the space taken by a static array, which is just the space taken by the data itself. But what about, say, a binary tree, or a set of integers?

This is a little trickier, but the idea is to use combinatorics to count the number of possible values of size n, and then take the logarithm of this. Because we need a distinct representation for each value (otherwise we couldn't tell which one we have), this will give us the average number of bits needed to represent an individual value. For example, suppose we have a binary tree with n nodes; we are ignoring data for now and just assuming each node has a left and right child reference. How many bits do we need to store this tree? Well, the number of rooted, ordered, binary trees with n nodes is described by Cn, the nth Catalan number, which can be defined by this formula:

Cn = (2n)! / ((n + 1)! n!)

Now, since Stirling's approximation tells us that log(n!) is nlg n plus some stuff asymptotically smaller than n (written o(n)), where lg is the log base 2, we can see that the logarithm of Cn is asymptotically approximated by:

log Cn ≈ (2n)lg(2n) − (n + 1)lg(n + 1) − nlg(n)
log Cn ≈ (2n)(lg(2) + lg(n)) − 2n lg(n)
log Cn ≈ 2n

In other words, we have to use at least 2 bits per node - this is the best we can possibly do. Surprisingly, there is a very simple encoding that does so. We traverse the tree in preorder, outputting 1 for each node and outputting 0 whenever we hit a null. Here's some C# code which implements the conversion back and forth between this compact bit string and a full in-memory tree structure, preserving both the tree's exact shape and its data:

         class BinaryTreeNode {
            public object data;
            public BinaryTreeNode left, right;
        }

        static void EncodeSuccinct(BinaryTreeNode node, ref BitArray bits, ref ArrayList data) {
            bits.Length++;
            bits[bits.Length - 1] = (node != null);
            if (node != null) {
                data.Add(node.data);
                EncodeSuccinct(node.left, ref bits, ref data);
                EncodeSuccinct(node.right, ref bits, ref data);
            }
        }

        static BinaryTreeNode DecodeSuccinct(BitArray bits, ArrayList data) {
            int bitsIndex = -1, dataIndex = -1;
            return DecodeSuccinctHelper(bits, ref bitsIndex, data, ref dataIndex);
        }

        static BinaryTreeNode DecodeSuccinctHelper(BitArray bits, ref int bitsIndex,
                                                   ArrayList data, ref int dataIndex) {
            BinaryTreeNode result;
            bitsIndex++;
            if (bits[bitsIndex])
            {
                result = new BinaryTreeNode();
                dataIndex++;
                result.data  = data[dataIndex];
                result.left  = DecodeSuccinctHelper(bits, ref bitsIndex, data, ref dataIndex);
                result.right = DecodeSuccinctHelper(bits, ref bitsIndex, data, ref dataIndex);
            } else {
                result = null;
            }
            return result;
        }

The original representation uses about 3 times as much space as the data, while the succinct representation uses only 2n + 1 extra bits, an overhead of 6% on a 32-bit machine. There is some constant overhead, but because this is asymptotically less than n, the data structure is still considered succinct. A number of applications are evident: trees can be compactly stored and transmitted across networks in this form, embedded and portable devices can incorporate pre-built tree data structures in the application's image for fast startup, and so on.

On the other hand, it would seem that you can't really do much with trees in this form - in limited memory environments, you might not be able to expand the tree back out to its full form. An alternative encoding with the same space requirement is to search the tree with breadth-first instead of depth-first search. If we do this, the resulting bit string has some useful properties. Define the Rank and Select functions as:

Rank(i) = the number of 1 bits between indexes 0 and i in the bit string
Select(i) = the index of the ith 1 bit in the bit string

The following can be proven: if we have the breadth-first search bit string for the tree, the children of the node represented by bit i are always represented by bits 2(Rank(i)) and 2(Rank(i)) + 1. The parent of index i is always located at index Select(i/2). Finally, the data for the node at index i is located at index Rank(i) in the data array. Using these lemmas, along with an efficient implementation of Rank and Select, we can traverse the tree and perform all the usual operations on it in this compact form. I won't get into implementing Rank and Select here, but suffice it to say that by adding only O(n/log n) additional bits (which again is asymptotically less than n) they can be implemented in constant time; if the hardware supports a find-first-zero and/or population (bit-count) operation, these can be quite efficient.

This same idea can be applied to all sorts of data structures: for example, there are m choose n (that is, m!/(n!(m-n)!)) different sets of integers between 1 and m, and there are data structures which use bits in the logarithm of this quantity to store such a set. The number of partitions of a set of objects is defined by the Bell numbers, and there are data structures using bits in the logarithm of these numbers. A variety of papers address succinct graph representations, and there are even schemes for representing graphs with specific properties that use no more bits than the logarithm of the number of graphs having those properties.

One of the most interesting succinct data structures, created by Franceschini and Grossi in 2003 in their Optimal Worst-Case Operations for Implicit Cache-Oblivious Search Trees, is a search tree which permits all the usual insert, delete, and search operations in O(log n) time, but with only constant space overhead! It stores the data in an array, and nearly all the information it needs to search it is contained in the order of the elements alone (similar to the heap used in heapsort, but with different operations). On top of this, the structure is cache-oblivious, meaning that it maximally utilizes every level of the memory hierarchy (within a constant factor) without any machine-specific tuning (I'll talk more about cache-oblivious data structures another time). I haven't read this paper yet, but for those who believe data structures are a "solved problem", this kind of result makes it clear that there are many wonders yet to behold.

Finally, I wanted to come back to the problem with ArrayList. Although the simple array-growing scheme used by typical dynamic arrays can waste linear space (sometimes as much unused space is reserved as is used for data), there are alternative dynamic array data structures which remain efficient but reserve only O(sqrt(n)) space, which is asymptotically optimal (there is a sqrt(n) lower bound). These are described in Brodnik et al.'s Resizeable Arrays in Optimal Time and Space.

Thanks for reading, everyone. Please ask any questions that you have, and don't hesitate to e-mail me if you want to.