XHTML Validation with XmlReader

As a developer on the ASP.NET team, it’s hard for me to avoid writing HTML.  And it’s important for that HTML to follow web standards for a variety of reasons.  I was recently fixing some bugs where I was outputting non-compliant XHTML when I thought, “I should have unit tests for this”.

Fortunately, .NET’s System.Xml.XmlReader already supports validation from DTD files.  I also need my tests to be fast and reliable so I can’t have it downloading a DTD from the DOCTYPE.  Thanks to some insight from the XML team, I can easily avoid this with the XmlPreloadedResolver:

 XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parser;
settings.XmlResolver = new XmlPreloadedResolver(XmlKnownDtds.Xhtml10);
XmlReader reader = XmlReader.Create(..., settings);

However, I wanted to validate for XHTML 1.1.  So I downloaded the flat DTD instead, which I deployed with my tests and referenced from the DOCTYPE:

 <!DOCTYPE {0} PUBLIC "-//W3C//DTD XHTML 1.1//EN" "xhtml11-flat.dtd">

Notice that I used the “{0}” placeholder in order to modify my root element per test.  Now, the actual validation:

 public void ValidateXhtml(string html) {
    // determine root element (must be single root)
    Match match = new Regex(@"<(\w+)[\s>]").Match(html);
    Assert.IsTrue(match.Success);
    Assert.IsTrue(match.Groups.Count > 1);
    var root = match.Groups[1].Value;

    // prepend DOCTYPE to specify root and local DTD
    html = String.Format(DOCTYPE, root) + html;

    // enable DTD validation
    XmlReaderSettings settings = new XmlReaderSettings();
    settings.DtdProcessing = DtdProcessing.Parse;
    settings.ValidationType = ValidationType.DTD;

    // validate!
    using (StringReader sr = new StringReader(html)) {
        using (XmlReader xr = XmlReader.Create(sr, settings)) {
            try {
                while (xr.Read());
            }
            catch (XmlException e) {
                Assert.Fail(e.Message);
            }
            catch (XmlSchemaException e) {
                Assert.Fail(e.Message);
            }
        }
    }
}

I followed up with a little investigation to see how this compares to the W3C validator.  Unfortunately, the XmlReader doesn’t seem to support skipping to the next error or node – so I couldn’t find a way to get a complete list of XHTML errors.  So instead, I did a manual comparison:

  1. Copied page source from www.twitter.com (uses XHTML 1.0 strict)
  2. Validated with the XmlReader (recorded and manually fixed each error until it passed)
  3. Validated original source with the W3C validator and compared against the XmlReader errors
  4. Validated updated source with the W3C validator and verified that it passed

Good news – the XmlReader detected all of the same errors and my updated source passed W3C’s validation.  The main difference was in how the errors were reported.  Looks like the XmlReader is compliant with W3C and will work for my unit tests!