A Dynamic Index: Reinventing the Online Help in XML


Benjamin Guralnik

April 3, 2002

Download Dynamicindex.exe.

Requirements This download requires Microsoft Internet Explorer 5.0 or later with MSXML3.0 installed.

Editor's Note   This month's installment of Extreme XML is written by guest columnist Benjamin Guralnik. Benjamin lives in Jerusalem, Israel and is an independent developer specializing in XML/XSLT. He has been a regular contributor to the online community surrounding Microsoft XML products. Benjamin has been pioneering XML-oriented information systems since 1999 when he began working on the award-winning SAS Interactive Handbook, an innovative help aide that reorganized SAS System's native help into a compact and useful interface. In his free time Benjamin enjoys reading, playing tennis, and studying classical piano.

While earlier help systems may have worked because they contained less information or fewer topics, the newer help systems are turning remarkably counter-productive in spite of the fact that they incorporate much more information than their prototypes did. The main reason that these combined resources fail to work is that the increasing volume of information is still managed using outdated and inadequate indexing and organization tools that are as ineffective on large sets of documents as they were successful on small ones. Thus, while the information itself is helpful, it is frequently buried by hundreds of irrelevant hits returned by the search engine or by the bewildering Table of Contents (TOC). Could this situation be fixed? You'll have to read the article to find out what role XML can play in solving this problem.

The Legacy

Today, when standards change overnight and product versions rise biannually, it's easy to see why the online help systems are becoming as important to the applications as the executables themselves. The time of the printed manual seems to be coming to an end, but it's quite surprising to see how little difference there is between a computerized help system and its predecessor, both in terms of structure and organization of material. Except for the full text search, which surpasses the human ability to scan large amounts of material, both the Table of Contents and the Index options are actually strict reproductions of what originated in the print medium many years ago. Given today's data processing technologies, it begs the question, "Why haven't we taken advantage of current technologies to revolutionize indexing?"

Table of Contents

Remember the uneasy feeling associated with browsing a new TOC for the very first time? The fact is that no matter how complete your knowledge in the field is, or how carefully the hierarchy was arranged, sometimes you don't know where the familiar categories and subjects are hiding. Imagine now how much harder it is for a beginner to find their way through the TOC while barely associating simple concepts and terms. To give you an idea of what the user experience might be like, imagine you're a COBOL developer working your way through the Microsoft Jscript® 5.5 online manual shown in Figure 1.

Figure 1. A sample TOC (JScript 5.5 Help)

As you can see, this listing is of little value for a person with no background in JScript. Unless you have used these methods and remember what they do, it's difficult to know what method you want. In fact, our bewildered developer was lucky to have opened the methods section at all, because a non-object-oriented mind would intuitively try to locate these methods under Functions.

A quick look behind the scenes of a TOC creation reveals an even gloomier picture. The main problem being that there is no way to synchronize a TOC with the documents it is meant to represent. Edited manually just like the rest of the content, a TOC is merely a product of a private and usually quite spontaneous arrangement of the titles by the coordinating help author. For this reason, keeping the TOC in sync with the rapidly changing and updated material is truly a difficult task. For example, when you withdraw outdated topics you might forget to remove their titles from the TOC, thus leaving them in the TOC in the form of dead links. On the other hand, you may remember to remove the titles but leave the orphaned pages, which results in unwanted items in searches and misinforms any user that runs across them.


Failing to find anything valuable in the TOC, one can hope for better luck with the index. That's one place where you won't get lost in complex hierarchy because the task is simple—sheer brainstorming on the topic. The problem is, once again, it requires the user to be acquainted with the subject's terminology. Even if the user is acquainted with the material, most indexes are fed with such an enormous amount of terminology that there's usually little chance of the user finding their topic. In most cases, it doesn't contain the keyword the user is looking for or it contains a dozen obscure terms of which the user may never consider.

In real life, one would rarely think of skimming ten books from cover to cover to find an occurrence of a specific word. With the migration of printed manuals into a binary representation we have, however, learned to think this way, and do it quite a lot. The result? Usually nothing but hundreds of hits sorted by their relevancy (or maybe irrelevancy, who knows.) While it is true that XML enables much more accurate and context-sensitive searches, it's not the ultimate solution by any means. Even an improved search engine cannot serve as a cornerstone of a help system because such a system should present the material systematically in some way or another, not just abandon the user at the Search tab.

Recent Pre-XML Practices

The first model that actually allowed the user to explore the information contained in a help package was the summarizing table (a concept borrowed from books as well), in which keywords on the left were accompanied by their descriptions on the right. While originally hand typed, summarizing tables were soon given a boost with a data-driven technology that allowed the storing of topics names, locations, and descriptions in a separate Microsoft Excel-like spreadsheet. This spreadsheet was then bound to an HTML table using a data provider called a Tabular Data Control, which also exposed basic sorting and filtering options, as seen in Figure 2.

Figure 2. Filterable table

As you can see, this prototype presented the material in a summarized and comprehensive manner, helping the user to get acquainted with the material and providing the big picture before diving into the details. In addition, it implemented the enlightened idea of separating content from presentation, which in this case was the Excel spreadsheet and the HTML table. The trouble was that because the separation was only being done halfway (the spreadsheet had to be created manually), both the hand-typing and the synchronization issues remained unresolved.

An XML-Smart Approach

Going back to the more complicated TOC model, it is easy to see that the XML solution lies in realizing that a TOC is nothing but a way of presenting information; however, not that of a single document (or even a spreadsheet), but that of a whole set of documents. After all, a topic's title and a short description should already be present in every document. Therefore, the only thing missing before we can start automating the TOC creation is some classifying info that we should add to every document, or in other words, an element that would tell under which category (or multiple categories) that a topic should appear. By supplying this info to all the documents, we should save hours of manual work as it is only a matter of seconds for a program to scan all the documents, retrieve a title from each, and map them all into a tree as dictated by the classifying information.

Take a look at the following recipe document, taken from an online cooking aid:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="../style/singlePage.xslt"?>
   <keyword>Apple Pie #1</keyword>
   <description>My favorite apple pie</description>
   <classify by    = "Grandma's recipes" 
             type1 = "desserts"
             type2 = "cakes and pies"
             type3 = "apple pies" />
   ... more info ...

Though the syntax of the classifying info, or the classifying instruction as it is called in this project, may be unfamiliar to you, it's straightforward and hopefully easy to understand. The numbered typeN= attributes (you could have as many as you want) are the ones responsible for the actual mapping of the title, with each attribute indicating an additional subgroup that the topic should appear under. The by= attribute selects the hierarchy in which the current document should be registered. While in the older TOC you had only one hierarchy to register your elements in, I figured out that since everything was automated now, I could start registering my documents in several hierarchies at the same time, each presenting the material from a different perspective, or suited for a different type of question. Figure 3 displays the hierarchy we get after compiling this recipe document.

Figure 3. Hierarchy derived from a single document

Had the cooking aid contained tens or even hundreds of documents, each registered in several hierarchies, compiling them into multiple TOCs would've been just as easy. You'd simply launch the wizard, specify the folder containing the documents, and voila—your hierarchies are ready. The whole process could be visualized as shown in Figure 4.

Figure 4. The Dynamic Index concept

Let's take a moment to summarize what we have gained from adding a classifying instruction to each document:

  • It allowed us to fully automate the creation of the TOC.
  • Synchronization is no longer an issue! Recompiling the TOC every time you change something will immediately update the hierarchies too. A topic's title will disappear from the hierarchy as soon as you physically remove the file from the project's folder. Dropping the file back to that folder will cause its title to reappear automatically, which is also what you do when you need to register new content.
  • The help writer decides where the topic appears. Although impossible without good coordination between the authors, it is still better to have each author classify their topics, rather than having all the topics classified at the end.

The Demo

In the example, which you can download from the link at the top of the page, I took a dozen familiar XSLT elements and classified the documents into as many hierarchies as I could think of. The results are shown in Figure 5 below.

Figure 5. XSLT Reference Demo

Another thing you might notice about the demo is that it extends the classic TOC content with an idea borrowed from the summarizing tables. In my opinion, keywords and descriptions make much more sense together than only titles of documents. This is also the reason why I called the project a Dynamic Index rather than a Dynamic TOC. The final structure that emerged is much more versatile than that of a TOC alone.


The conceptual innovations of the Dynamic Index address both help authors and end-users alike. The users, first and foremost, get a truly versatile aide to work with; one that can show them the material from several different points of view. Help authors, on the other hand, are now set free from the dull task of arranging the hierarchies manually. Moving forward, it allows help authors to focus on the accessibility of their manuals, namely by developing specialized views for different profiles of users, questions, and so on. Moreover, updating and revising the content becomes much easier because the whole process is automated. It is these innovations that I believe will allow us to start seeing comprehensive, XML-powered help systems.

Appendix I: The Classifying Stylesheet

Being not only the compiler's core but also its most interesting feature, the classifying style sheet is the place where the actual mapping of the titles into a tree occurs. The algorithm's general purpose and its many useful techniques have led me to believe that it would be appropriate to explain it in depth here, both for novices and advanced programmers.

Before the style sheet can be applied, there's a JScript program that must scan all the documents, creating an entry for every single classifying instruction it encounters. It is against these entries that the style sheet is then applied. Each entry consists of:

  1. The document's recalculated relative URL.
  2. The classifying instruction (this will tell the algorithm where to place the title).
  3. The keyword (located at the /doc/keyword section of the document).
  4. The keyword's description (/doc/purpose).

Here's a sample:

<entry url="XSLT Reference/apply-templates.xml">
   <classify by    = "conventional"
              type1 = "XSLT Reference"
              type2 = "XSLT Elements" />
   <desc>Directs the XSL processor apply the appropriate template, 
      based on the type and context of each selected node.</desc>

When all the entries are ready, the compiler asks the user whether to exclude any of the hierarchies from the final index. Then, the main classifying work begins. Since the style sheet can create only one tree at a time, we will be applying it separately for each hierarchy against only relevant entries. This is also the reason why the hierarchy's name must be passed through the $hiername parameter as follows:

    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
    <xsl:output method="xml" indent="yes" omit-xml-declaration="no"/>
<xsl:param name="hiername"/>
<xsl:template match="/">
   <hierarchy name="{$hiername}">

First, we handle all top-level entries (entries that were registered into the hierarchy under no category at all):

   <xsl:for-each select="root/entry[not(classify/@*[
      <xsl:copy-of select="."/>

The remaining entries, each having to appear under one or multiple categories, are then passed to the main classifying template. This recursive template takes two arguments:

  1. Group: the entries to be subclassed at this step, sorted by the according typeN=.
  2. Level: keeps track of recursion/tree depth.

As this is our first call to the recursion, we only need to pass the classifiable entries sorted by type1 of their instruction:

   <!-- Here's the main call of the recursive template -->
   <xsl:call-template name="hier">
      <xsl:with-param name="level" select="1"/>
      <xsl:with-param name="group">
         <xsl:for-each select="root/entry[classify/@type1]">
            <xsl:sort select="classify/@type1"/>
            <xsl:copy-of select="."/>

Next comes the main algorithm.

<xsl:template name="hier"> <!-- main recursive template -->
   <xsl:param name="level"/> <!-- recursion level -->
   <xsl:param name="group"/> <!-- items to be sub-grouped -->
   <xsl:variable name="curType" select="concat('type', $level)"/>
   <xsl:variable name="nextType" select="concat('type', 
                        number($level + 1))"/>

Note that the auxiliary $curType and $nextType variables are used to store the names (not the values) of the current and the next typeN= attributes, according to the pass. That is, on the first time that the recursion is called, they evaluate to type1 and type2 respectively, on the second they become type2 and type3 and so on.

Next, we must break down the entries into groups, with each group containing entries that share a common type. But how do we achieve this goal? The idea is simple. As you may recall, in the previous step we sorted all the entries against their type1. Well, the only thing left to do now is to filter out the duplicate values by taking only the first value plus every subsequent value that differs from the one preceding it, as shown on Figure 6 below.

Figure 6. Animals' Categories (distinct values are marked in blue)

Here is the same idea translated into XSLT:

<xsl:for-each select="msxsl:node-set($group)/entry[
      position() = 1 or 
      preceding-sibling::entry[1]/classify/@*[name() = $curType] !=
      classify/@*[name() = $curType]]">

As stated earlier, for every distinct typeN= value, we now create a separate group. Inside that group, we first drop off all the leaf entries, or in other words, those entries whose classification has been completed at this point. The leaves are recognized by the absence of their next typeN= attribute. For example, had we been classifying against type4, then the leaves naturally wouldn't have type5.

      <group name="{$curTypeValue}">
         <!-- leaf elements go first -->
         <xsl:for-each select="msxsl:node-set($group)/entry
               [classify/@*[name() = $curType] = $curTypeValue]
               [not(classify/@*[name() = $nextType])]">
            <xsl:copy-of select="."/>

Finally, we make the recursive self-call, passing on the current group of entries for further subclassing. The three things you should note here are:

  1. The increment of $level.
  2. The way in which leaf entries are filtered out so that only those that do have further classification get passed on.
  3. The essential sorting of the entries for the next step.

Here's the code:

         <xsl:call-template name="hier">
            <xsl:with-param name="level" select="number($level + 1)"/>
            <xsl:with-param name="group">
             <!-- select only nodes that have further classification -->
               <xsl:for-each select="msxsl:node-set($group)/entry
                   [classify/@*[name() = $curType] = $curTypeValue]
                   [classify/@*[name() = $nextType]]">
                   <xsl:sort select="classify/@*[name() = $nextType]"/>
                   <xsl:copy-of select="."/>


I would like to thank Dan Doris from Microsoft for his continuous support and encouragement, as well as my Dad for the countless discussions before and during the development of this paper.