Handling White Space with XmlTextReader

Note

In the .NET Framework version 2.0, the recommended practice is to create XmlReader instances using the XmlReaderSettings class and the Create method. This allows you to take full advantage of all the new features introduced in the .NET Framework 2.0. For more information, see Creating XML Readers.

White space can be categorized in two ways: significant and insignificant. Significant white space is any white space inside a mixed content model defined by the document type definition (DTD), or white space inside the scope of the special attribute, xml:space, when the xml:space is set to "preserve". Significant white space is any white space that you need to have preserved from the original document to the final document. Insignificant white space is white space that you do not need to preserve from the reading of the document to the output document. White space can be any of the following characters:

  • Space (ASCII space, 0x20)

  • Carriage return (CR, 0x0D)

  • Line feed (LF, 0x0A)

  • Horizontal tab (0X09)

The World Wide Web Consortium (W3C) standards dictate that white space be handled differently, depending on where in the document it occurs, and depending on the setting of the xml:space attribute. If the characters occur within the mixed element content or inside the scope of the xml:space="preserve", they must be preserved and passed without modification to the application. Any other white space does not need to be preserved.

The XmlTextReader treats white space as significant only if it occurs within an xml:space="preserve" context. Because the XmlTextReader does not parse a DTD, the reader does not recognize the significant white space that is defined as mixed content in the DTD, as the reader will not know that there has been mixed content defined. If you need to recognize the significant white space in a mixed content element, then the XmlValidatingReader can be used as it parses the DTD and recognizes mixed content elements.

To see what type of white space is in the current node, use the NodeType property. Significant white space is returned with an enumeration of SignificantWhitespace, whereas insignificant white space is returned with an enumeration of Whitespace.

The WhitespaceHandling property uses an enumeration to determine how white space is returned by the reader. For more information on the enumeration values, see WhitespaceHandling. For more information on the W3C standards, see Section 2.10 of the Extensible Markup Language (XML) 1.0 recommendation at www.w3.org/XML/Group/2000/07/REC-xml-2e-review\#sec-white-space.

Here is an example of XML that contains white space and has the xml:space attribute set to "preserve". The newline character is illustrated as a special white space character at the end of the lines in this example.

<!DOCTYPE test [

<!ELEMENT test (item | book)*> <-- element content model -->

<!ELEMENT item (item*)> <-- element content model -->

<!ATTLIST item xml:space (default | preserve) #IMPLIED>

<!ELEMENT book (#PCDATA | b | i)*> <-- mixed content model -->

<!ELEMENT b (#PCDATA)> <-- mixed content model -->

<!ELEMENT i (#PCDATA)> <-- mixed content model -->

]>•

<test>•

••••<item>•

••••••••<item xml:space="preserve">º

ºººººººººººº<item/>º

ºººººººº</item>•

••••</item>•

••••<book>º

ºººººººº<b>This<b>º

ºººººººº<i>is</i>º

ºººººººº<b>a test</b>º

ºººº</book>•

</test>•

The white space shown as (•) is insignificant white space. The white space shown as (º) is significant white space.

Note

The scope of the xml:space attribute changes what would normally be considered insignificant white space to be significant white space.

Equally, the book element is defined as a mixed content model in the DTD, indicating that it can contain the b or i elements. In a mixed content model, the white space within the book element is considered to be significant white space. The XmlTextReader will not recognize the mixed content model because it does not use the information provided in the DTD. You should use XmlValidatingReader to get significant white space nodes in mixed content models.

See Also

Concepts

Full Content Reads Using Character Streams

Document Type Declaration Information

Attribute Value Normalization

Exception Handling Using XmlException in XmlTextReader

Reading XML with the XmlReader