Revamping the RSS Bandit Application

 

Dare Obasanjo
Microsoft Corporation

September 15, 2003

Summary: Dare Obasanjo revisits his RSS Bandit project, a C# application that retrieves and displays news feeds from various Web sites, and improves it using various XML features of the .NET Framework to build a rich .NET client application. (16 printed pages)

Download the RSSBandit Installer.msi sample file.

Introduction

In my previous article I described the inner workings of the RSS Bandit application, which aggregated information from various Web sites by processing RSS feeds on the Web. As described in my previous article, RSS is an XML format used for syndicating news and similar content from online news sources.

An RSS feed is a regularly updated XML document that contains metadata about a news source and the content in it. Minimally an RSS feed consists of a channel that represents the news source, which has a title, link, and description that describe the news source. Additionally, an RSS feed typically contains one or more item elements that represent individual news items, each of which should have a title, link, or description.

Since writing my article, RSS has become more wide spread as a mechanism for disseminating news across the Web. Not only are RSS feeds used by online news sources such as Yahoo! News, the BBC, and Rolling Stone magazine, but they have also become popular amongst developer-centric information sources such as the Microsoft Developer Network (MSDN) , the Oracle Technology Network (OTN), and the Sun Developer Network. With the proliferation of RSS feeds, a desktop news aggregator is a powerful tool for people interested in keeping abreast of information from a variety of news sources without having to navigate several Web sites to get their information fix.

In the past few months the RSS Bandit Workspace on GotDotNet has been fairly active with contributions from various members of the .NET Framework developer community, such as Torsten Rendelmann, Michael Earls, Joe Feser and a number of other members of the RSS Bandit workspace. This article describes the inner workings of various additions to the RSS Bandit application over the past few months.

A Look at the RSS Bandit User Interface

The user interface for RSS Bandit is inspired by mail and newsreaders such as Microsoft Outlook® and Microsoft Outlook Express. An improved user interface is the most notable difference between the current version of RSS Bandit and the version in my previous article. RSS Bandit uses the Magic Library, which is a framework of user interface controls that provide richer functionality than the basic Windows Forms controls.

click for larger image

Figure 1. Reading news with RSS Bandit

One of the powerful features provided by the Magic Library is the ability to create tabbed panes that allows one to nest several forms in a single application. Figure 2 below shows how the tabbed panes in the Magic Library enable one to utilize multiple Web browser windows from within RSS bandit.

click for larger image

Figure 2. Browsing the Web with RSS Bandit

Overview of the RSS Bandit Architecture

The RSS Bandit application consists of two distinct parts—the Graphical User Interface components and the XML and networking components. The primary GUI classes are the WinGuiMain and RssBanditApplication classes, while the primary XML and networking class are RssHandler and RssLocater.

The RssHandler class downloads RSS feeds at specified intervals and hands them over to a CacheManager that stores them. The store used by the CacheManager is not tightly coupled to the application and, in fact, the CacheManager is an abstract class that currently has one concrete implementation, the FileCacheManager, which caches files on the local file system. This flexibility means it is possible to introduce new types of CacheManager in the future that use better and more optimized stores, such as a database management system. Similarly, the RssHandler class is not tightly coupled to the user interface and can be reused by other applications that need to process RSS feeds. Clients that utilize the RssHandler class register a callback (delegate) upon instantiating the class. The RssHandler object then invokes the registered callback when new or updated feeds are downloaded. The information about which feeds to download and other configuration data is obtained from a feed-subscription list written in XML. The RssLocater is used when attempting to discover the RSS feed for a particular Web site and it uses a well-defined set of heuristics when attempting to locate the feed.

The RssBanditApplication inherits from ApplicationContext and controls the WinGuiMain. This is a Windows Form that contains a tree view for displaying the list of subscribed feeds grouped by free definable categories, a list view for displaying information about items from the currently selected feed in a threaded manner you are familiar with from any NNTP Reader (like Outlook Express), and an embedded Web browser for displaying the item content. On startup, the RssBanditApplication figures out if there is a running instance of the program. If so, it forwards any command line arguments to that instance and terminates itself. If no instance of the class is running, then the RssBanditApplication registers a delegate with the RssHandler, which manages downloading and processing RSS feeds. Whenever new or updated feeds are downloaded, the RssBanditApplication is updated through the delegate in a thread-safe manner using techniques described in the Safe, Simple Multithreading in Windows Forms, Part 1 article by Chris Sells.

The RssBanditApplication class also acts as a Mediator for the various user interface components (menus, toolbar buttons, context menus, and so on) and delegates actions to the WinGuiMain, RssHandler, or handles them by itself as the user interacts with the application. Each primary user interface component that can initiate an action on the user's behalf implements the ICommand interface (the Command pattern) and the ICommandComponent interface to abstract away the implementation details of the various classes

The user interface also enables the user to manage various aspects of the behavior of the RssHandler class. The user can add and remove feeds from the subscription list, configure how often feeds should be downloaded, organize it in categories, and set proxy server information. The RssItemFormatter class handles the displaying the content of a news item using the XslTransform class. It takes a user-defined XSLT stylesheet and transforms the RssItem, which implements the IXPathNavigable interface, into HTML.

XML Technologies and RSS Bandit

The RSS Bandit application makes significant use of the XML technologies in the .NET Framework. RSS Bandit uses XML Serialization to convert the XML configuration files to objects and vice versa, XSLT to enable customizable views of news items, XPath to process HTML content of RSS feeds and remove potentially malicious elements, and the System.Xml.XmlWriter class to ensure that it writes out well-formed XML among other things.

Customizable Themes Using XSLT

When news items are being read in RSS Bandit they are displayed in a pane on the bottom right of the application, which is actually an embedded Web browser control. The original version of RSS Bandit did not take advantage of the flexibility gained by using an embedded Web browser to display the content of news feeds. In the current version of RSS Bandit, one can create an XSLT stylesheet that customizes how news items appear in the Web browser pane. Figure 3 is a screenshot of the configuration menu where one can choose a particular stylesheet from the templates folder of the RSS Bandit application

Figure 3. Choosing a Custom Stylesheet

Each downloaded RSS feed is represented as a FeedInfo object that contains a list of RssItem objects. The RssItem class implements the IXPathNavigable interface, which means it is acceptable input for the Transform method of the System.Xml.Xsl.XslTransform class. The implementation of the IXPathNavigable interface exposes the RssItem as an RSS 2.0 XML feed containing a single item that represents the data within the RssItem. When displaying the news item in the Web browser pane, the currently selected XSLT stylesheet and RssItem instance are passed as input to the Transform method of the XslTransform class, which then renders the results of the transformation in the browser pane.

Since most RSS feeds do not use XHTML in their content, but instead favoring either plain text or regular HTML, it is necessary to process such feeds using Chris Lovett's SgmlReader class, which can be used to convert HTML content to XHTML.

Importing Feed Lists with XSLT

There are a few existing XML formats for storing a list of RSS feeds to which a user is subscribed. These formats include OPML, OCS, and the format that I chose for RSS Bandit in my previous article.

Although RSS Bandit works with my feed list format internally, it is possible to import feed lists that are in either the OPML or OCS format. If the imported feed list is not in the RSS Bandit format, then a check is made to see whether it is in the OPML or OCS format, and if the feed list is in either format, then a stylesheet that converts the particular format to the RSS Bandit feed list format is invoked on the imported feed list. Below is the stylesheet that converts OCS files to my feed subscription list format:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
xmlns="http://www.25hoursaday.com/2003/RSSBandit/feeds/" 
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
xmlns:dc="http://purl.org/metadata/dublin_core#" 
exclude-result-prefixes="dc rdf">
  <xsl:output method="xml" indent="yes" />
  <xsl:template match="/">
    <feeds>
      <xsl:for-each select="/rdf:RDF/rdf:description/rdf:description">
        <feed>
          <title>
            <xsl:choose>
              <xsl:when test="dc:title">
                <xsl:value-of select="dc:title" />
              </xsl:when>
              <xsl:otherwise>
                <link>No title for RSS feed provided in imported OCS</link>
              </xsl:otherwise>
            </xsl:choose>
          </title>
          <link>
            <xsl:choose>
              <xsl:when test="rdf:description/@about">
                <xsl:value-of select="rdf:description/@about" />
              </xsl:when>
              <xsl:otherwise>
                <link>No URL for RSS feed provided in imported OCS</link>
              </xsl:otherwise>
            </xsl:choose>
          </link>
        </feed>
      </xsl:for-each>
    </feeds>
  </xsl:template>
</xsl:stylesheet>

Once the imported file is converted to the RSS Bandit feed subscription list format, it is merged with the internal representation of the feed subscription list processed at startup.

Configuration Files and W3C XML Schema

The RSS Bandit feed list format is described using a W3C XML Schema definition (XSD) file that enables the application to utilize the XML Serialization feature of the .NET Framework to convert the XML to strongly typed objects, which provides a more natural programming model when interacting with the contents of the feed list format.

There is also an XML configuration file format for the integrated search feature of RSS Bandit. One can search the Web using a choice of one or more search engines directly from the RSS Bandit user interface. By default, the configuration file contains information about Google, Feedster, and MSN Search. The search configuration file is also processed using the XmlSerializer class to convert it to a graph of strongly typed objects to provide a more natural programming model for interacting with the configuration information. Below is the schema for the search configuration file.

<xs:schema
targetNamespace='http://www.25hoursaday.com/2003/RSSBandit/searchConfiguration/' 
 xmlns:xs='http://www.w3.org/2001/XMLSchema' elementFormDefault='qualified' 
 xmlns:c='http://www.25hoursaday.com/2003/RSSBandit/searchConfiguration/'>
  <xs:element name='searchConfiguration'>
    <xs:complexType>
      <xs:sequence>
        <xs:element name='engine' minOccurs='0' maxOccurs='unbounded'>
          <xs:complexType>
            <xs:sequence>
              <xs:element name='title' type='xs:string' />
              <xs:element name='search-link' type='xs:anyURI'>
                <xs:annotation>
                  <xs:documentation>
       This defines the base URL of the search engine. 
       The placeholder for the search expression is '[PHRASE]' without
     the single quotes but with the brackets!
                           </xs:documentation>
                </xs:annotation>
              </xs:element>
              <xs:element name='description' type='xs:string' />
              <xs:element name='image-name' type='xs:string' />
            </xs:sequence>
            <xs:attribute name='active' type='xs:boolean' />
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute name='open-newtab' type='xs:boolean' use='optional' />
    </xs:complexType>      
  </xs:element>
</xs:schema>

Figure 4 shows how the information in the search configuration file is used in the RSS Bandit application.

click for larger image

Figure 4. Searching the Web from RSS Bandit

Generating Well-formed XML and Exporting OPML Files

The OPML format is used by a number of news aggregators to store feed list information, so it is beneficial to users of RSS Bandit for it to be able to export its internal feed list to an OPML file. In the original version of RSS Bandit, I generated OPML files with the following code that worked in a few test cases, but failed once enough users tried out the functionality.

StringBuilder sb = new StringBuilder("<opml>\n<body>\n"); 
            
            if(_feedsTable != null){

               foreach(feedsFeed f in _feedsTable.Values){
                  sb.AppendFormat("<outline title='{0}' xmlUrl='{1}' 
                  />\n", f.title, f.link);
               }
            }            
   sb.Append("</body>\n</opml>");

The problem with the above code is that it treats constructing XML and concatenating text values together, which although a tempting proposition, is wrong. RSS Bandit generated ill-formed XML when people used characters considered special to XML, such as ampersand (&) or single quote (') in the title of an RSS feed. To fix this problem, I decided to use the .NET Framework class specifically designed for writing out XML—the XmlWriter class. Below is the same task rewritten to use the XmlWriter class.

XmlTextWriter writer = new XmlTextWriter(feedStream,System.Text.Encoding.UTF8);

writer.WriteStartElement("opml");
writer.WriteStartElement("body");            
            
if(_feedsTable != null) {

     foreach(feedsFeed f in _feedsTable.Values) {
       writer.WriteStartElement("outline");
       writer.WriteAttributeString("title",f.title);
       writer.WriteAttributeString("xmlUrl", f.link);
       writer.WriteEndElement();                  
     }
   }            

   writer.WriteEndElement(); //close <body>
   writer.WriteEndElement(); //close <opml>
   writer.Flush();
   writer.Close();   

XPath and RSS Bandit: Autodiscovering Feeds

One of the major difficulties in subscribing to a Web site's RSS feed is discovering where its RSS feed is located. In August of 2002, Mark Pilgrim described an algorithm for an ultra-liberal RSS locater that consisted of the following steps:

  1. Given the main address of a Web site, download the home page and look for LINK elements that point to RSS feeds . If you find any, use them.
  2. If the site doesn't support RSS autodiscovery through LINK elements, scan all the links on the page and guess intelligently about which one(s) of them points to an RSS feed. Links to addresses on the same server that end in .rss, .rdf, or .xml are prime candidates for being feeds. Download each of these and see which ones actually are RSS feeds by checking the initial content of each file.
  3. If unsuccessful, look for links to addresses on the same server that contain rss, rdf, or xml anywhere in the address. Check to see if any of them is an RSS file.
  4. If still unsuccessful, repeat the previous two steps in order, but expand the search to include addresses on external servers since many weblogs use a third-party service to provide RSS feeds for their Web site. Weed out 127.0.0.1 addresses then check to see if any remaining are RSS files.
  5. If still unsuccessful, then look on Syndic8. Syndic8 keeps track of thousands of RSS feeds for various sites and provides an XML-RPC interface for interacting with it programmatically.

RSS Bandit implements the aforementioned autodiscovery process using the RssLocater class. The first step involves downloading the Web site and searching it for links. Being an XML aficionado, I wanted to use XPath to search the document for links but realized this would be difficult because most sites are not written using the XML-based XHTML markup language, but rather prior versions of HTML that are not compatible with XML. This is where Chris Lovett's SgmlReader class comes to the rescue. The SgmlReader class can read in an HTML document and present it as an XML document, which can then be processed using the traditional XML APIs in the .NET Framework. The following code fragment shows how one can obtain all the LINK elements that reference an RSS feed in an HTML document using XPath.

   SgmlReader reader = new SgmlReader(); 
   reader.InputStream = new StreamReader(GetWebPage(url));
   reader.Href = url;    
   reader.DocType= "HTML";           
   XmlDocument doc = new XmlDocument(); 
   doc.XmlResolver = null; 
   doc.Load(reader);

   ArrayList list = new ArrayList(); 
       
   //<link rel="alternate" type="application/rss+xml" title="RSS" href="url/to/rss/file">

   foreach(XmlNode node in doc.SelectNodes("//*[local-name()='link' and 
@type='application/rss+xml' and @title='RSS']/@href")){
     string url = ConvertToAbsoluteUrl(node.Value, node.BaseURI); 
     if(LooksLikeRssFeed(url)){
       list.Add(url); 
     }
   }

Figure 5 is a screenshot shows the dialog that pops up when the Autodiscover Feeds button is clicked on the RSS Bandit application when the site being viewed in the embedded Web browser is the MSDN home page.

Figure 5. Start Feed Autodiscovery

The screenshot in Figure 6 shows the results of a successful attempt at locating the RSS feed for a site.

Figure 6. Feed(s) located

Using XPath to Filter Potentially Malicious Content

As mentioned earlier, the HTML content in an RSS feed is converted to XHTML and then displayed in the browser pane. This can lead to security issues if care is not taken to strip potentially malicious elements such as script blocks from the HTML content within the RSS feed. Since the HTML content within an RSS feed is converted into XHTML using Chris Lovett's SgmlReader class, it is fairly straightforward to use XPath and the XmlDocument class to strip out unwanted markup. The following code fragment shows the how XPath is used to filter out potentially malicious elements and attributes from the content within an RSS feed.

//remove potentially malicious tags 
      string badtagQuery = "//@style | //*[local-name()='script' or local-
name()='object'or local-name()='embed' or local-name()='iframe' or local-
name()='meta' or local-name()='frame'or local-name()='frameset' or local-
name()='link' or local-name()='style']";

      foreach(XmlNode badtag in doc.SelectNodes(badtagQuery)){
            
         XmlAttribute badattr = badtag as XmlAttribute;

         if(badattr != null){
            badattr.OwnerElement.Attributes.Remove(badattr); 
         }else{
            badtag.ParentNode.RemoveChild(badtag); 
         }
      }

Posting Comments from RSS Bandit

I initially created RSS Bandit as a way to track various weblogs I visited regularly. Early on I realized that one of the interesting aspects of weblogs is their conversational nature, especially the way one can watch discussions ripple across the Web. RSS Bandit has a number of features that attempt to take advantage of the conversational nature of weblogs.

If a news item from a particular RSS feed references or is referenced by other news items that are also in RSS Bandit, then such relationships are displayed in the user interface as threaded messages reminiscent of e-mail and news readers. This provides a nice visual mechanism to track discussions across the weblogs one is subscribed to because posts that reference each other are shown together. RSS Bandit also provides various means to interact with the comments posted in response to a particular news item depending on the information provided in its RSS feed. There are number of RSS elements that are used to provide information about the comments to a news item—the comment element provides a link to where one can post comments to the news item in the user interface, the slash:comments element is used to indicate the number of comments that have been posted in response to the news item, the wfw:commentRss element provides the location of the RSS feed for the comments to the news item, while the wfw:comment element provides the URI that accepts RSS items sent as replies to the news items using HTTP post.

RSS Bandit supports the aforementioned elements that provide information about the comments to a particular news item. Figure 7 below is a screenshot of RSS Bandit showing all four of the comment-related RSS elements in use.

click for larger image

Figure 7. Read and Post Comments from RSS Bandit

The mechanism for posting comments to RSS Bandit is the CommentAPI, which specifies how applications can send responses to news items in an RSS feed by posting an RSS item to a particular URI. The following code fragment shows how RSS Bandit uses HTTP POST to send a reply to a news item in an RSS feed.

public HttpStatusCode PostCommentViaCommentAPI(string url, RssItem item){
      
      HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url);
      request.Timeout          = 1 * 60 * 1000; //one minute timeout 
      request.UserAgent        = this.UserAgent; 
      request.Proxy            = this.Proxy;
      request.Credentials = CredentialCache.DefaultCredentials;
      request.Method = "POST";
      request.ContentType = "text/xml";
      string comment = item.ToString(true);
      request.ContentLength = comment.Length; 

      StreamWriter myWriter = null; 
      try{ 
         myWriter = new StreamWriter(request.GetRequestStream());
         Trace.WriteLine(comment);             
         myWriter.Write(comment); 
      } catch(Exception e){
         
         throw new WebException(e.Message, e); 
      }finally{
         if(myWriter != null){
            myWriter.Close();    
         }
      }
  HttpWebResponse response = (HttpWebResponse) request.GetResponse(); 
  return response.StatusCode;    
}

It should be noted that support for the CommentAPI is not yet widespread. A list of Web sites the support the CommentAPI is provided in the list of CommentAPI implementations. A notable supporter of the CommentAPI is the .Text blog engine that is used by both Weblogs @ ASP.NET and Weblogs @ DotNetJunkies.com.

RSS Bandit has a virtual folder where it stores all comments posted through the CommentAPI that is shown in Figure 8.

click for larger image

Figure 8. The Sent Items Folder

Plug-in Architecture

In a recent episode of MSDN TV called Passing XML Data Inside the CLR, Don Box describes various mechanisms for passing XML between applications in the .NET Framework. There are a number of types one could choose as representations of an XML document in the .NET Framework, such as an instance of the String , IXPathNavigable, XmlDocument, or XmlReader classes, each of which has its tradeoffs. Recently Simon Fell proposed the IBlogExtension interface as a common mechanism for news aggregators built on the .NET Framework to share information with plug-ins and chose the IXPathNavigable interface as the means of passing XML (specifically RSS items) between news aggregators and plug-ins.

RSS Bandit supports the IBlogExtension interface, thus allowing developers to build plug-ins that integrate with RSS Bandit. Figure 9 is a screenshot showing the integration of the w.bloggar plug-in available at http://www.sharpreader.net/wBloggarPlugin.zip written by Luke Hutteman, which provides the ability for users to post about a particular news item in their weblog using the popular w.bloggar weblog editor.

click for larger image

Figure 9. Post to your Weblog from RSS Bandit

Future Plans for RSS Bandit

RSS Bandit has grown by leaps and bounds since my last article, primarily due to the efforts of myself and Torsten Rendelmann, with help from a number of people from the RSS Bandit workspace. I plan to continue development of RSS Bandit along with various members of the .NET developer community and keep using it as a way to showcase the capabilities of the .NET Framework as a platform for building rich client applications that harness the power of XML.

There are a few features I'd like to see in the next major release of RSS Bandit, such as automatic updates using the Updater Application Block, a newspaper-like view of a selection of news items, and the ability to directly edit one's weblog from RSS Bandit. If you'd like to help with adding these feature or others to RSS Bandit, feel free to join the workspace and help out. We can always use the help.

Dare Obasanjo is a member of Microsoft's WebData team, which among other things develops the components within the System.Xml and System.Data namespace of the .NET Framework, Microsoft XML Core Services (MSXML), and Microsoft Data Access Components (MDAC).

Feel free to post any questions or comments about this article on the Extreme XML message board on GotDotNet.