Anatomy of a software bug, part 1 - the NT browser

No, I don't mean that the NT browser's a software bug...

Actually Raymond's post this morning about the network neighborhood got me thinking about the NT browser and it's design.  I've written about the NT browser before here, but never wrote up how the silly thing worked.  While reminiscing, I remembered a memorable bug I fixed back in the early 1990's that's worth writing up because it's a great example of how strange behaviors and subtle issues can appear in peer-to-peer distributed systems (and why they're so hard to get right).

Btw, the current design of the network neighborhood is rather different than this one - I'm describing code and architecture designed for systems 12 years ago, there have been a huge number of improvements to the system since then, and some massive architectural redesigns.  In particular, the "computer browser" service upon which all this depends is disabled in Windows XP SP2 due to attack surface reduction.  In current versions of Windows, Explorer uses a different mechanism to view the network neighborhood (at least on my machine at work).

 

The actual original design of the NT browser came from Windows for Workgroups.  Windows for Workgroups was a peer-to-peer networking solution for Windows 3.1 (and continued to be the basis of the networking code in Windows 95).  As such, all machines in a workgroup needed to be visible to all the other machines in the workgroup.  In addition, since you might have different workgroups on your LAN, it needed to be able to enumerate all the workgroups on the LAN.

One critical aspect of WfW is that it was designed for LAN environments - it was primarily based on NetBEUI, which was a LAN protocol designed by IBM back in the 1980's.  LAN protocols typically scale quite nicely to several hundred computers, after which they start to fall apart (due to collisions, etc).  For larger networks, you need a routable protocol like IPX or TCP, but at the time, it wasn't that big a deal (we're talking about 1991 here - way before the WWW existed).

As I mentioned, WfW was a peer-to-peer product.  As such, everything about WfW had to be auto-configuring.  For Lan Manager, it was ok to designate a single machine in your domain to be the "domain controller" and others as "backup domain controllers", but for WfW, all that had to be automatic.

To achieve this, the guys who designed the protocol for the WfW browser decided on a three tier design.  Most of the machine on the workgroup would be "potential browser servers".  Some of the machines in the workgroup would be declared as "browser servers", one of the machine in the workgroup was the "master browser server".

Client's periodically (every three minutes) sent a datagram to the master browser server, and the master browser would record this in it's server list.  If the server hadn't heard from the client for three announcements, it assumed that the client had been turned off and removed it from the list.  Backup browser servers would periodically (every 15 minutes) retrieve the browser list from the master browser.

When a client wanted to browse the network, the client sent a broadcast datagram to the workgroup asking who the browser servers were on the workgroup.  One of the backup or master browser servers would respond within several seconds (randomly).  The client would then ask that browser server for its list of machines, and would display that to the user.

If none of the browser servers responded, then the client would force an "election".  When the potential browser servers received the election datagram, they each broadcast a "vote" datagram that described their "worth".  If they saw a datagram from another server that had more "worth" than they did, they silently dropped out of the election.

A servers "worth" was based on a lot of factors - the system's uptime, the version of the software running, their current role as a browser (backup browsers were better than potential browsers, master browsers were better than backup browsers).

Once the master browser was elected, it nominated some number of potential browser servers to be backup browsers

This scheme worked pretty well - browsers tended to be stable, and the system was self healing.

Now once we started deploying the browser in NT, we started running into problems that caused us to make some important design changes.  The biggest one related to performance.  It turns out that in a corporate environment, peer-to-peer browsing is a REALLY bad idea.  There's no way of knowing what's going on on another persons machine, and if the machine is really busy (like if it's running NT stress tests), it impacts the browsing behavior for everyone in the domain.  Since NT had the concept of domains (and designated domain controllers), we modified the election algorithm for to ensure that NT server machines were "more worthy" than NT workstation machines, this solved that particular problem neatly.  We also biased the election algorithm towards NT machines in general, on the theory that NT machines were more likely to be more reliable than WfW machines.

There were a LOT of other details about the NT browser that I've forgotten, but that's a really brief overview, and it's enough to understand the bug.  Btw, I'm the person who coined the term "Bowser" (as in "bowser.sys") during a design review meeting with my boss (who described it as a dog) :)

Btw, Anonymous Coward's comment on Raymond's blog is remarkably accurate, and states many of the design criteria and benefits of the architecture quite nicely.  I don't know who AC is (my first guess didn't pan out), but I suspect that person has worked with this particular piece of code :)