Manage and Generate the SharePoint Thesaurus files with a SharePoint List (Part 2 of 2)

In the last post we addressed the basics of implementing the SharePoint thesaurus file using a SharePoint list.  This included creating the content type, fields and list schema we would use to make this a reality.  As part of that exercise we also created a custom action, a ribbon button, that we could use to generate the thesaurus from the list item entries.  This post will address the page that the custom action links to and the code behind that does all the work.

As part of creating the custom action we linked to a page in the _layouts/blog folder called thesaurus.aspx and passed that a query string parameter called List, which was the ListId for the list where the custom action was clicked.  Using that list identifier we want to iterate each of the entries that exist and create a corresponding entry for our thesaurus file.  The syntax for the thesaurus file is pretty simple and shown below.

    1: <XML ID="Microsoft Search Thesaurus">
    3: <!--  Commented out
    5:     <thesaurus xmlns="x-schema:tsSchema.xml">
    6:     <diacritics_sensitive>0</diacritics_sensitive>
    7:         <expansion>
    8:             <sub>Internet Explorer</sub>
    9:             <sub>IE</sub>
   10:             <sub>IE5</sub>
   11:         </expansion>
   12:         <replacement>
   13:             <pat>NT5</pat>
   14:             <pat>W2K</pat>
   15:             <sub>Windows 2000</sub>
   16:         </replacement>
   17:         <expansion>
   18:             <sub>run</sub>
   19:             <sub>jog</sub>
   20:         </expansion>
   21:     </thesaurus>
   22: -->
   23: </XML>

The code to generate the thesaurus is also fairly straight forward.  The more difficult part of generating this file is that it must be in Unicode.  In our case UTF-16 LE (little-endian), which means that it must have a byte-order mark (BOM).  The image below is a partial hex representation of the file above.  The hex values are shown in the middle and the output on the right.  Notice the first two hex characters, FF and FE and the glyphs that those hex values represent.  I have found that if your thesaurus file doesn’t contain these BOMs SharePoint has trouble parsing it.  Most people just edit the existing thesaurus file, so they don’t run into this.  You can’t see them in the file, but they are there.  The trick to generate your own from scratch and have it work is to output these characters to begin the file and also, because it is Unicode, have the double bytes produced, shown below as the periods between the characters.  We won’t go into Unicode or double byte characters as part of this exercise.  There is plenty of information out there for those interested in learning more.


Back to our task at hand.  It seems simple enough, and it should be, although I went through several iterations before I could get it just right.  The key, at least that I found, is the stream that is used to output the file.  The code is shown below. 

    1: public partial class thesaurus : LayoutsPageBase
    2:     {
    3:         protected void Page_Load(object sender, EventArgs e)
    4:         {
    5:             //get the list reference
    6:             string listId = Request.QueryString["List"].ToString();
    7:             if (!string.IsNullOrEmpty(listId))
    8:             {
    9:                 Guid listGuid = new Guid(listId);
   10:                 //iterate the values and generate the file
   11:                 SPWeb web = SPContext.Current.Web;
   12:                 SPList list = web.Lists[listGuid];
   14:                 using (MemoryStream mem = new MemoryStream())
   15:                 {
   16:                     using (StreamWriter stream = new StreamWriter(mem, Encoding.Unicode))
   17:                     {
   19:                         XmlWriterSettings settings = new XmlWriterSettings();
   20:                         settings.Encoding = Encoding.UTF8;
   21:                         settings.OmitXmlDeclaration = true;
   22:                         settings.ConformanceLevel = ConformanceLevel.Fragment;
   23:                         settings.Indent = true;
   24:                         settings.IndentChars = "\t";
   25:                         XmlWriter writer = XmlWriter.Create(stream, settings);
   27:                         writer.WriteStartElement("XML");
   28:                         writer.WriteAttributeString("ID", "Microsoft Search Thesaurus");
   29:                         writer.WriteStartElement("thesaurus", "x-schema:tsSchema.xml");
   30:                         writer.WriteElementString("diacritics_sensitive", "0");
   31:                         writer.WriteComment("Generated on " + DateTime.Now.ToShortDateString() + " " + DateTime.Now.ToShortTimeString());
   33:                         if (list != null)
   34:                         {
   35:                             foreach (SPListItem item in list.Items)
   36:                             {
   37:                                 if (item["ThesaurusEntryType"].Equals("Expansion"))
   38:                                 {
   39:                                     writer.WriteStartElement("expansion");
   40:                                     writer.WriteElementString("sub", item["ThesaurusWord"].ToString().Trim());
   41:                                 }
   42:                                 else if (item["ThesaurusEntryType"].Equals("Replacement"))
   43:                                 {
   44:                                     writer.WriteStartElement("replacement");
   45:                                     writer.WriteElementString("pat", item["ThesaurusWord"].ToString().Trim());
   46:                                 }
   47:                                 //write the substitutions
   48:                                 string substitutions = item["ThesaurusSubs"].ToString();
   49:                                 string[] subs = substitutions.Split(',');
   50:                                 foreach (string sub in subs)
   51:                                 {
   52:                                     writer.WriteElementString("sub", sub.Trim());
   53:                                 }
   55:                                 //close type
   56:                                 writer.WriteEndElement();
   57:                             }
   59:                             //}
   61:                             // close thesaurus
   62:                             writer.WriteEndElement();
   64:                             //close xml
   65:                             writer.WriteEndElement();
   67:                             //writer.WriteEndDocument();
   68:                             writer.Flush();
   69:                             writer.Close();
   70:                             // Convert the memory stream to an array of bytes.
   71:                             byte[] bytes = mem.ToArray();
   73:                             // Send the XML file to the web browser for download.
   74:                             Response.Clear();
   75:                             Response.Buffer = true;
   76:                             Response.ContentEncoding = Encoding.Unicode;
   77:                             Response.AppendHeader("content-disposition", "attachment; filename=tsneu.xml");
   78:                             Response.Cache.SetExpires(DateTime.UtcNow.AddMinutes(-1));
   79:                             Response.Cache.SetCacheability(HttpCacheability.NoCache);
   80:                             Response.Cache.SetNoStore();
   82:                             Response.AppendHeader("Content-Length", bytes.Length.ToString());
   83:                             Response.ContentType = "text/xml";
   85:                             Response.BinaryWrite(bytes);
   87:                             Response.End();
   88:                         }
   89:                     }
   90:                 }
   91:             }
   92:         }
   93:     }

In all the attempts I made to get the output correct I used an XmlWriter to create the thesaurus file.  I tried a MemoryStream, StringBuilder and a few others for backing the XmlWriter with an actual stream but had no luck.  Although we set the XmlWriterSettings to use Unicode and in the other instances I set the stream to do the same, it seems (again in my limited testing) they don’t output this as I would expect, which may be a limited understanding on my part.  It wasn’t until I used a StreamWriter, with a MemoryStream together that I was able to get the output correct.  The rest is pretty straight-forward, iterate the list items in the list and output to the file based on the type of thesaurus entry being made.

Once deployed the custom action should look similar to the screenshot below.  By clicking the ‘Create Thesaurus File’ button the file will be generated and you are prompted to save it locally.


There are several enhancements that could be made to the project.  These include the ability to validate the data in the list.  The SharePoint thesaurus doesn’t like dashes (-) in values or duplicate entries.  It could also be enhanced to generate thesaurus files for different countries based on a column that indicates the target country.  I am sure there are other improvements that could be made as well.

I placed the completed solution on MSDN Code Gallery.  You can download the completed project here.