July 2012

Volume 27 Number 07

The Working Programmer - The Science of Computers

By Ted Neward | July 2012

Ted NewardFor many years, I’ve heard colleagues describe themselves as “computer scientists.” I’ve also heard it said that, “Any field of study that has to put ‘science’ in the name isn’t really a science.” Certainly a fairly large number of people in my field have had no formal training in computer science, and another fairly large number look with disdain at the work coming out of universities and collegiate professorial offices—and with some justification. When’s the last time a slew of Greek symbols purporting to describe a type system helped you get that line-of-business app up and running?

I know a number of ridiculously smart people, and one of them is Joe Hummel, someone who can legitimately claim the title computer scientist, owing to the Ph.D. behind his name and all. I asked him to join me in writing this column to explore the science of computing. I thought it would be interesting to start with some of the simpler parts of computer science and work through how understanding the theory behind some of those parts can, in fact, make your life as a working programmer just a little bit easier.

For this installment, we tackle the classic “Big O” discussion, which seeks to give us a consistent way to describe a simple concept: How many resources (time, memory, power and so on) will the computer need to execute a given program? This determination is referred to as algorithm analysis, with the goal of placing your chosen algorithm into a particular mathematical category. Why? So you can select the best one for a situation. There are many categories, with each successive category denoting a jump in resource consumption. We’re going to focus on execution time, and just the first three categories: O(1), O(lgN) and O(N). As is typical, N denotes the number of data items processed by the algorithm. For an executive summary of where we’re going with all this, see Figure 1, which highlights the execution time you can expect for each category as N grows.

Execution Times as the Number of Items Processed (N) Grows
Figure 1 Execution Times as the Number of Items Processed (N) Grows

An order N algorithm is written O(N), pronounced “Big O of N” and also known as “linear.” Given N data items, a O(N) algorithm takes on the order of N time steps. An order log N algorithm is written O(lgN), pronounced “Big O of log N” and referred to as “logarithmic.” Given N data items, a O(lgN) algorithm takes on the order of log2(N) time steps (many fewer). Finally, an order 1 algorithm is written O(1), pronounced “Big O of 1” and referred to as “constant.” For N data items, a O(1) algorithm takes a constant number of time steps (fewer still). Take a moment to revisit Figure1. What else might this graph convey? For example, notice that when N doubles, a linear algorithm takes twice as much time.

Let’s look at an example of how this applies in real life, and where knowing just a little about algorithm analysis can make a huge impact. Suppose you’re given a log file to search for matches against a list of suspicious IP addresses. Simple enough:

List<string> suspicious = BuildList(); /* step 0 */
foreach (string line in logfile)
  string ipaddr = parse(line);         /* step 1 */
  if (suspicious.Contains(ipaddr))     /* step 2 */
    matches++;                         /* step 3 */

How do we categorize the time complexity of this approach? Let’s assume there are M lines in the log file and N suspicious IP addresses. Building an unsorted list of N elements (step 0) will incur a one-time cost of O(N). The loop will perform M iterations, so the cost of the loop is O(M * cost of one iteration). Each iteration performs three basic steps:

  1. Parses the line
  2. Searches the list
  3. Increments matches if a match is found

Getting Analytical with It

The intellectual aspect of algorithm analysis is learning what to count and what not to count. For example, while parsing a line—perhaps applying a regular expression or splitting the line into tokens—might be expensive, it takes a constant amount of time relative to the data. In other words, the time required to parse a line remains entirely unchanged, whether there are M or 2M lines in the file. Nor does it change whether there are N or 2N suspicious IP addresses. Thus, we say steps 1 and 3 take c (constant) time. However, step 2, which searches the list, does change relative to the number of suspicious IP addresses:

if (suspicious.Contains(ipaddr))

The Contains method performs a linear search, starting from the first element and searching until either a match is found or the end of the list is reached. (How could we know this? One of three ways: we tested it, we looked at the source code or the MSDN  Library documentation tells us.)

With random data, on average we’ll search half the list, requiring N/2 steps. This would hold for most cases, but in this particular case, a stronger argument is that most IP addresses in the log file are trustworthy, because our Web server isn’t a bank or a presidential candidate’s Web site. This means that rather than there being an even chance that a given client is untrustworthy, the vast majority of searches will exhaust the list without finding a match, requiring N steps.

Because constants are dropped when computing the order of things, the search is O(N) in either case, yielding a complexity of O(M * N) for the loop. In turn, this yields an overall algorithmic complexity of:

= O(build + loop)
= O(N + M * N)
= O( (1 + M) * N )
= O(M * N)

Thus, version 1 of our solution requires O(MN) time.

‘Data, Data, Data! I Cannot Make Bricks Without Clay!’

The first question to ask yourself is: “Do we need a better algorithm?” M is typically large, in the millions (possibly billions) of lines. There’s not much you can do about that, because you have to touch each line to search it. If N, the number of suspicious IP addresses, is small (say, N < 100), then your work is done. With today’s fast machines, there’s little measurable difference between M time steps and 100M time steps. However, as N grows larger (for example, N >> 100, or significantly larger than 100), it’s worthwhile to consider other search algorithms.

Unless you have a really big hardware budget, of course. (Sometimes, even then.)

A more efficient algorithm is binary search, which jumps to the middle and then searches left or right depending on whether the element you’re looking for is less than the middle element or greater. By cutting the search space in half each time, the algorithm exhibits O(lgN) behavior. What does this mean in practice? Given N=1,000 elements, at worst binary search takes on the order of 10 time steps (210 ≈ 1,000), while linear search would take 1,000. For N = 1 million, the difference is even more dramatic—on the order of 20 time steps versus 1 million. The key trade-off is that binary search requires the data to be in sorted order. Here’s the log file search program rewritten to use binary search:

List<string> suspicious = BuildList();      /* step 0 */
suspicious.Sort();                          /* step 0.5 */
foreach (string line in logfile)
  string ipaddr = parse(line);               /* step 1 */
  if (suspicious.BinarySearch(ipaddr) >= 0)  /* step 2 */
    matches++;                               /* step 3 */

Sorting the suspicious list (step 0.5) represents a one-time cost of O(NlgN), while the cost per iteration is reduced from O(N) to O(lgN) given the more efficient binary search (step 2). This yields an overall algorithmic complexity of:

= O(build + sort + loop)
= O(N + N * lgN + M * lgN)
= O(N * (1 + lgN) + M * lgN)
= O(N * lgN + M * lgN)
= O((N + M) * lgN)

In our scenario, where M >> N, the N acts more like a small constant when added to M, and so version 2 of our solution requires approximately O(MlgN) time. So? For N = 1,000 suspicious IP addresses, we just reduced the execution time from an order of 1,000M time steps to 10M, which is 100x faster. The additional sort time of roughly 10,000 time steps (O(NlgN)) is more than offset by the repeated savings in search time.

Let’s Go Hashing

When you know something about the data you’re searching, you can (and should) build a hash table. Hashing reduces the average search time to O(1)—and you can’t do any better than that. It does this by creating a mapping of a single value to the elements stored in the table, where that value is produced by a hash function. The key requirement for hashing is the existence of a good hash function that maps the search data across a range of integers, with few collisions (duplicate mappings). Our revised log file program (version 3) takes advantage of built-in support for effective string hashing in the Microsoft .NET Framework:

Dictionary<string, int> suspicious = BuildHashTable();  /* step 0 */
foreach (string line in logfile)
  string ipaddr = parse(line);                          /* step 1 */
  if (suspicious.ContainsKey(ipaddr))                   /* step 2 */
    matches++;                                          /* step 3 */

Building the hash table (step 0) incurs a one-time, up-front cost of O(N). This yields an overall algorithmic complexity of:

= O(build + loop)
= O(N + M * 1)
= O(N + M)

Assuming our scenario of M >> N, version 3 is our fastest solution, at roughly O(M) time. For N = 1,000 suspicious IP addresses, we just reduced the execution time by another factor of 10, which is 1,000x faster than our original version. And what’s remarkable about this approach is that for even larger values of N—say, 10,000, or even 100,000—the execution time is the same, on the order of M time steps.

So what’s the catch? With version 2 and binary search, it’s simple: The list of IP addresses must first be sorted. In this version with hashing, there are three issues to consider: creating an effective hash function, the up-front construction cost and a larger memory footprint. The key is the hash function. An effective hash function spreads the data across the table with a minimum of collisions, eliminating the need for additional searching after the initial hash into the table. Such a function offers an up-front construction cost of O(1) per datum, or O(N), the same as linear search. However, such effectiveness typically involves a load factor in the neighborhood of 10 percent, which means the data occupies only 10 percent of the table. Hashing thus requires a memory footprint 10x larger than linear and binary search, though formally this is still O(N) like the others.

Why Are We Doing This, Again?

After all that pseudo-math, it might strike the non-computer science graduate that we did a lot of theoretical manipulation, and while we think we found an efficient solution, we sure dropped a lot of constants along the way. And this leads us to the million-­dollar question: “Does our theoretical analysis really pay off in practice?” A fair question, especially when you consider the ultimate goal of this article—to encourage you to use algorithm analysis (if you aren’t already) to evaluate competing approaches before committing to code.

Many times I’ve sat around a table watching engineers argue for hours—even days, sometimes—over whether approach A is better than approach B. Rarely do folks pull out a napkin and perform a quick algorithmic analysis.

So, now that we’ve done the analysis, let’s run some tests and see what happens in practice. To do that, we generated random log files with M entries and then searched these files against randomly generated lists of N IP addresses. (Complete source and input files are available at bit.ly/srchlogfiles.)

To recap, assuming M >> N, we derived the algorithmic complexities shown in Figure 2.

Figure 2 Algorithmic Complexities AssumingM >> N

  Time (t)
Version 1: Using Linear Search O(M * N)
Version 2: Using Binary Search O(M * lgN)
Version 3: Using Hashing O(M)

For our tests, we set M at 1 million, and double N for each test: 128, 256, 512 and so on. The doubling of N highlights the performance characteristics of the search algorithms. If our analysis is accurate, each time N is doubled, the execution time should change, as shown in Figure 3.

Figure 3 Predicted Change in Execution Time WhenNIs Doubled

  Time for 2N
Version 1 O(M * 2 * N) => 2 * O(M * N) => 2t
Version 2 O(M * lg(2N)) => O(M * (lg2 + lgN)) => O(M * (1 + lgN) =>    O(M + M * lgN) => M + O(M * lgN) => M + t
Version 3 O(M) => t

In other words, version 1 should double in time (2t), version 2 should take another M time steps (t + M) and version 3 should take no additional time (t). Figure 4 shows the actual recorded times (in seconds).

Figure 4 Actual Execution Times (Seconds)

M N Version 1 Version 2 Version 3
1,000,000 128 3.81 2.88 2.15
  256 5.65 2.98 2.15
  512 9.39 3.11 2.17
  1,024 16.78 3.23 2.19
  2,048 31.52 3.36 2.19
  4,096 61.18 3.49 2.28
  8,192 120.63 3.68 2.39
  16,384 238.86 3.95 2.54
  32,768 473.91 4.34 2.89

As you can see in Figure 4, version 1—based on linear search—does indeed double in time as N doubles. That’s probably a bad thing for most situations.

Version 2—based on binary search—runs orders of magnitude faster than version 1 and grows much more slowly. The growth rate is harder to quantify because it depends on the time TM to execute M operations, which appears to be anywhere from 0.1 to 0.3 seconds. According to our analysis, version 2 should be growing at a constant rate of TM as N doubles, but this doesn’t appear to be the case; this could be the result of increased cache pressure (and thus cache misses) as N grows.

Finally, Figure 4 confirms that version 3 is indeed the fastest algorithm, with essentially the same execution time regardless of N (though again not quite constant).

Seek Balance

From a pragmatist’s perspective, it’s not clear that the enhancement from version 2 to version 3 is worth anything more than a trivial amount of effort—despite the savings of more than a second. The potential savings isn’t really going to make a big difference unless we start to see really big values of M or N (keeping in mind that hashing might require 10x more memory to store the IP addresses). And this raises the last point: knowing when to stop seeking to optimize. It might be exciting and compelling for a programmer to find the absolute fastest algorithm to perform some kind of operation, but unless this operation is front-and-center in the user’s face, the time spent trying to optimize that last second away often could be much better spent working on something else. Clearly the jump from version 1 to version 2 is worth the time (that’s a savings of about 10,000 percent), but as with all things, the programmer must seek balance. Sometimes theory must give way to practice, but at other times, the theory helps dictate where we should put our efforts into practice.

Happy coding!

Ted Neward is an architectural consultant with Neudesic LLC. He has written more than 100 articles and authored or coauthored a dozen books, including “Professional F# 2.0” (Wrox, 2010). He’s a C# MVP and speaks at conferences around the world. He consults and mentors regularly—reach him at ted@tedneward.com or Ted.Neward@neudesic.com if you’re interested in having him come work with your team, or read his blog at blogs.tedneward.com.

Joe Hummel is a private consultant, a member of the technical staff at Pluralsight LLC, a trainer for MVP-Training.net and a visiting researcher at the Department of Computer Science at the University of California, Irvine. He earned a Ph.D. at UC Irvine in the field of high-performance computing and is interested in all things parallel. He resides in the Chicago area, and when he isn’t sailing can be reached at joe@joehummel.net.