Literate programming

Literate programming, invented in 1981 by the same Donald Knuth who wrote The Art of Computer Programming and the document language TeX, is a technique in which a program is written as a human-oriented document interspersing discussion and code. The code segments are arranged not according to execution order or the logical structure of the code, but in whatever order the author believes offers the best presentation to the reader. A tool can be used to automatically extract the code segments, put them back in the order expected by the compiler, and export them as a source file. Although it never really caught on for ordinary coding, it has the important ability to make code easier to understand for a reader, easing maintenance, transfer of ownership, and helping to prevent future bugs.

The idea is perhaps easiest to understand if we look at it not from the perspective of a coder, but from the perspective of a researcher writing a book. They write a section about an algorithm, showing first the core algorithm, then talking about issues, improvements, and variations. In each section they show a little more code. They might show a test driver to demonstrate how it might be invoked. Taken to the logical extreme, these code samples could, taken together, show every part of the program - all someone would need to do to run it is to copy the samples into a source file in the right order and build it. If we automate this extraction process, we now have a section of a book that simultaneously is a real program that can be run.

Some people find the idea of structuring a program in this seemingly random "presentation order" strange, but really we do it all the time. Suppose you arrive on a new team and take control of a module from someone else. At some point, you sit down with the previous owner, who proceeds to describe how the module works. If they do this effectively, they don't take an alphabetical file listing and go through each file in order - instead, they first focus on core parts of core functionality and expand outward to more peripheral functionality. You might ask a question about a piece of code you see, and they say "Don't worry about that - I'll get to that later when I talk about X." By controlling the presentation order, they enhance your comprehension by keeping code out of the way until you have the proper background to understand its purpose.

But what if this previous owner is long gone? They could have all the comments in the world, but you still won't have anything like the discussion above. Literate programming enables them to write a tutorial document that takes you through the code in the same friendly way as a sit-down with the owner, and yet never invokes redundant effort or becomes out-of-date, because it is the code as well.

An example

But there's no better way to understand literate programming than to see some. So, right here, I'm going to discuss an implementation in C# of Bloom filters, the topic of my very first post (read this first if you haven't already). This implementation is based on a paper by a former of professor of mine, Panagiotis Manolios, and one of his students, Peter Dillinger, called Bloom Filters in Probabilistic Verification . This paper solves a frustrating implementation issue: Bloom filters require a potentially unlimited number of independent hash functions, and there's no convenient way of coming up with these hash functions on the fly. Their concept is to use only two independent hash functions to generate all of the hash values required. The insertion method below, based on Algorithm 2 in their paper, demonstrates this:

 <<BloomFilter Insert method>>=
    public void Insert(TKey obj)
    {
        int h1 = obj.GetHashCode();
        int h2 = obj.GetHashCode2();
        unchecked
        {
            h1 = (int)(((uint)h1) % table.Count);
            h2 = (int)(((uint)h2) % table.Count);
        }
        <<perform insertion>>
    }

The syntax <<perform insertion>>, used by the noweb tool, indicates a chunk of code that I haven't talked about yet. The label at the top is the name of this chunk of code. We use unchecked arithmetic for the casts because the hash code methods can return negative values, which we wish to convert to large positive values via the cast to uint. The narrowing cast back to int is safe because table.Count is in an int.

Above, TKey is a generic type parameter on the class specifying the type of values stored in the Bloom filter. Keys must implement the IDoubleHashable interface, which adds a second hash function GetHashCode2() independent from the standard GetHashCode() supplied by Object:

 <<IDoubleHashable definition>>=
    public interface IDoubleHashable
    {
        int GetHashCode2();
    }

<<BloomFilter class definition>>=
public class BloomFilter<TKey> where TKey : IDoubleHashable
{
    <<BloomFilter implementation>>
}

We store the bits of the BloomFilter in a standard BitArray collection, which compactly stores the bits of the table and handles all the bit arithmetic for us:

 <<BloomFilter member variables>>+=
        private BitArray table;

This is in the System.Collections namespace, so we need a using statement:

 <<BloomFilter using statements>>+=
using System.Collections;

The += noweb syntax above is used to add a piece of code to a code section. This is useful for sections we wish to describe sequentially a part at a time, or for elements like the above where order is unimportant.

Now, we're prepared to describe the insertion. We set a number of roughly evenly spaced bits, where the first hash establishes the initial index, the second hash establishes the spacing, and the spacing is slightly increased as we go along. All arithmetic is done modulo the table size:

 <<Perform insertion>>=
        for(int i=0; i<numHashes; i++) 
        {
            table[h1] = true;
            unchecked
            {
                h1 = (int)((uint)(h1 + h2) % table.Count);
                h2 = (int)((uint)(h2 + i) % table.Count);
            }
        }

The number of bits we set is determined by the fixed parameter numHashes. We'll describe how it's chosen a bit later on:

 <<BloomFilter member variables>>+=
        private int numHashes;

The lookup procedure is analogous, checking all the same bits and returning true only if they're all set:

 <<BloomFilter Contains method>>=
        public bool Contains(TKey obj)
        {
            int h1 = obj.GetHashCode();
            int h2 = obj.GetHashCode2();
            unchecked
            {
                h1 = (int)(((uint)h1) % table.Count);
                h2 = (int)(((uint)h2) % table.Count);
            }
            for (int i = 0; i < numHashes; i++) 
            {
                if (table[h1] == false)
                {
                    return false;
                }
                unchecked
                {
                    h1 = (int)((uint)(h1 + h2) % table.Count);
                    h2 = (int)((uint)(h2 + i) % table.Count);
                }
            }
            return true;
        }

The probability that this method will return true for a random value not in the set is given by:

    p = (1 − ekn/m)k

where the variables are defined as:

  • k: Called numHashes in the code, the number of table elements set by an insertion
  • m: Called table.Count in the code, the number of elements in the table
  • n: The number of distinct objects inserted so far

To initialize the object, we use the following method, which needs to know k and m:

 <<BloomFilter initialization>>+=
        private void Initialize(int tableSize, int numHashes)
        {
            this.numHashes = numHashes;
            table = new BitArray(tableSize, false);
        }

However, a user is unlikely to know the best values for these (which is why this isn't a constructor). Instead, we address two common use cases:

  1. In this case, the user knows m, the desired table size in bits, and n, an estimate of how many elements they will insert. We choose k in a way that minimizes the false positive probability p, using the value (ln 2)m/n rounded to the nearest integer:

     <<BloomFilter initialization>>+=
            public BloomFilter(int tableSize, int numElements)
            {
                Initialize(tableSize, (int)Math.Round(ln2*numElements/tableSize));
            }
    
  2. In this case, the user knows n, an estimate of how many elements they will insert, and p, the desired false positive probability. Bloom filter theory states that if k is chosen optimally (using the formula in part 1 above), we will have p = αm/n, where the constant α = 1/2ln 2. This means we must have m = (logα p)n. I call α optimalBaseRate below:

     <<BloomFilter initialization>>+=
        public BloomFilter(int numElements, double falsePositiveProb)
        {
            int tableSize = (int)Math.Ceiling(numElements * Math.Log(falsePositiveProb, optimalRateBase));
            Initialize(tableSize, (int)Math.Round(ln2 * tableSize / numElements));
        }
    

The constants used above are defined at the class level (we cannot use const because the Math calls are considered non-constant by the compiler):

 <<BloomFilter constants>>+=
    private readonly double ln2 = Math.Log(2.0);
    private readonly double optimalRateBase = 1.0/Math.Pow(2, Math.Log(2.0));

We need the System namespace for Math:

 <<BloomFilter using statements>>+=
using System;

Finally, we put it all together to complete the implementation of BloomFilter:

 <<BloomFilter implementation>>=
<<BloomFilter constants>>
<<BloomFilter initialization>>
<<BloomFilter member variables>>
<<BloomFilter Insert method>>
<<BloomFilter Contains method>>

We structure the program as two source files, with the interface in its own file, and both entities in the BloomFilters namespace:

 <<BloomFilter.cs>>=
<<BloomFilter using statements>>
namespace BloomFilters {
<<BloomFilter class definition>>
}

<<IDoubleHashable.cs>>=
namespace BloomFilters {
<<IDoubleHashable definition>>
}

Our implementation is now complete. My full version of this code includes double-hashable classes for integers and strings and a test driver, which I omit here for space. With a bit of added syntax, I could run this entire entry through the literate programming tool noweb, and it would produce both an HTML document much like the one you're reading now, along with two C# source files which I can then build into an assembly and use.

More information

For a longer, fancier example of literate programming, take a look at my Select implementation, which uses the literate programming tool noweb to write an implementation of the classical worst-case linear time selection algorithm using C++ and the document language LaTeX. Here we gain the additional advantage of a full-fledged document language, complete with the ability to typeset mathematical expressions and diagrams, which is impossible in ordinary source code.

Sometimes schemes such as Perl's pod, Javadocs, or C#'s XML comments are touted as being literate programming, but this is far from the case. These are simply isolated documentation of individual logical program units. While they are valuable and have little overhead, they do not present the program; they only document parts of it in an isolated way.

So when should you use literate programming? Certainly there is an overhead involved, but in any case where the ability to maintain a portion of a program is a critical factor, or where there is a high risk of an owner disappearing, or where code will be exposed to many external partners or customers, literate programming is a valuable tool for keeping the code accessible and easy to understand. Try rewriting a single confusing or troublesome portion of your program with it, update your build system to run the literate programming tools prior to compilation, and see if it makes a difference with your application.