Mastering Text in Open XML WordprocessingML Documents

Article
07/24/2014

Summary: Understand how to reliably retrieve text from Open XML WordprocessingML documents.

In this article
Introduction
Understanding Text Content in WordprocessingML
Best Practice: Accept Revisions before Processing
Understanding WordprocessingML Abstractions
How Varying Levels of Hierarchy Increase Complexity in Processing
Introducing the LogicalChildrenContent Axis Method
Implementing the DescendantsTrimmed Axis Method
Defining Logical Children
Using the LogicalChildrenContent Axis Method
ExamineDocumentContent Example
Retrieving the Text of Paragraphs
Two Useful Overloads of the LogicalChildrenContent Axis Method
Identity of XML Elements Returned by the LogicalChildrenContent Method
Searching Documents for Text
Conclusion
Additional Resources

Published: May 2010

Provided by: Eric White, Microsoft Corporation

Contents

Introduction
Understanding Text Content in WordprocessingML
Best Practice: Accept Revisions before Processing
Understanding WordprocessingML Abstractions
How Varying Levels of Hierarchy Increase Complexity in Processing
Introducing the LogicalChildrenContent Axis Method
Implementing the DescendantsTrimmed Axis Method
Defining Logical Children
Using the LogicalChildrenContent Axis Method
ExamineDocumentContent Example
Retrieving the Text of Paragraphs
Two Useful Overloads of the LogicalChildrenContent Axis Method
Identity of XML Elements Returned by the LogicalChildrenContent Method
Searching Documents for Text
Conclusion
Additional Resources

Download code

Introduction

Processing text in Open XML word-processing documents seems very simple at first—you have the body of the document, paragraphs and tables in the body, and rows and cells in tables, exactly like HTML, right? Then it seems very hard—you see the markup for revision tracking, numbered and bulleted lists, content controls, markup that does not affect text, such as bookmarks and comments. Styles seem like they do not affect text, but if there are numbered and bulleted lists, they do. Actually, the truth is, it is somewhere around the middle. There is a lot to track, but each one of these features, taken by itself, is not very difficult.

That said, there are some basic ideas and abstractions that can simplify how you think about word-processing markup. These abstractions are relevant regardless of whether you are working with word-processing markup by using the Open XML SDK 2.0 strongly-typed object model by using the Welcome to the Open XML SDK 2.0 for Microsoft Office with LINQ to XML, or using some other platform, such as Java or PHP. We can write code that addresses these abstractions. The code can expose exactly those elements that you are interested in, in an organized, predictable manner. In this article, I present Microsoft Visual C# code written with both LINQ to XML and with the Open XML SDK 2.0 strongly-typed object model. Because the semantics of some useful methods are defined carefully, they are easy to implement in whatever language and platform that you are using.

Understanding Text Content in WordprocessingML

In the main body part of a document all text is contained in paragraphs. We find paragraphs in three locations: as a child of the body element (w:body), as a child of a cell in a table (w:tc), and as a child of text box content (w:txbxContent). A cell can itself contain a table. There are other instances of text in the main document part. Pictures can contain alternative text, and SmartArt graphic contains text. However, those pieces of text are more isolated. The issues around assembling the text of multiple strings into a single string do not apply to them.

One of the interesting dynamics of text content is that a paragraph can contain a run, the run can contain a drawing, a drawing can contain a text box, which can then contain paragraphs. This is the one and only circumstance in Open XML WordprocessingML markup where you find a paragraph element as a descendant of another paragraph element. More on this, and the challenges this presents later.

Best Practice: Accept Revisions before Processing

The first and most important point about simplifying how you process WordprocessingML content is that you should first accept all tracked revisions. For more information about the semantics of tracked revisions, see, Accepting Revisions in Open XML Word-Processing Documents. Also see the Microsoft Visual C# 3.0 code sample for accepting tracked revisions at the PowerTools for Open XML project on CodePlex. Click the Downloads tab, and then download RevisionAccepter.zip.

The great thing about accepting tracked revisions first is that after you do this, you can safely ignore more than 40 elements that complicate how you process content. Many of those elements have complicated semantics. Therefore, it is much better to process them, and then process the contents of the document. Until I wrote that MSDN article, and wrote the code to accept revisions, I did not appreciate all of the cases in which more simplistic approaches result in retrieving the wrong text for a paragraph.

In many circumstances, you want to query a document without modifying it. You can use a simple technique of reading the document into a byte array, creating a resizable memory stream from the byte array, and then opening the document from the memory stream. For more information about how to do this, see the blog post Simplifying Open XML WordprocessingML Queries by First Accepting Revisions. This example lets you accept revisions and query the document without touching the actual document on disk.

Understanding WordprocessingML Abstractions

To help understand WordprocessingML markup, let's define some abstractions:

Block-level content container
Block-level content
Run-level content container
Run-level content
Sub-run-level content

After you accept tracked revisions, and decide to ignore some elements that are only applicable in advanced scenarios, you are left with the following list of elements to process.

Block-Level Content Containers

Block-level content containers are those WordprocessingML elements that contain block-level content such as paragraphs or tables. There are only three block-level content container elements that occur in the main document part:

Block-Level Content Container Elements

Element	Element Name	Open XML SDK 2.0 Class Name Namespace: DocumentFormat.OpenXml.Wordprocessing
Body	w:body	Body
Table Cell	w:tc	TableCell
Text Box Content	w:txbxContent	TextBoxContent

Body

w:body

Body

Table Cell

w:tc

TableCell

Text Box Content

w:txbxContent

TextBoxContent

As I mentioned, there are other block-level content containers in WordprocessingML that contain paragraphs, such as the w:comment element in the comments part and the w:hdr element in the header part. However, they are not in the main document part. Therefore, they do not present the same processing challenges.

Block-Level Content

Block-level content elements are the WordprocessingML elements that occupy all the width of the layout surface. They are bounded on the top and bottom, and occupy the available width from left to right of the available space. As an example, in the usual course of layout of a document, you do not see two paragraphs on the same physical line, or see a paragraph and a table side-by-side.

There appears to be exceptions to this rule, but in fact these apparent exceptions are not really exceptions. One example in which you see paragraphs side-by-side is using a multi-column page layout. In this case, the available width for layout of the paragraph or table is the column, not the complete page. Another example is when there is a text box on the page, but in this case, the available width for layout of the block-level content does not include the space reserved for the text box. Further, the text box itself has its own layout surface.

After accepting revisions, there are only two block-level content elements.

Block-Level Content Elements

Element	Element Name	Open XML SDK 2.0 Class Name Namespace: DocumentFormat.OpenXml.Wordprocessing
Paragraph	w:p	Paragraph
Table	w:tbl	Table

Paragraph

w:p

Paragraph

Table

w:tbl

Table

Two additional block-level content elements that I do not address in this article are those for math formulas. Processing MathML text content is not a common need. There is not a great demand in collecting the text of a formula and aggregating it into a single string (as you do with a paragraph). Instead, text in a formula must be taken in context of the formula. In this article, I do not address the processing of MathML formulas.

Run-Level Content Containers

After accepting revisions, there is exactly one element that is a run-level content container, which is the paragraph (w:p) element. The run-level content container defines the space in which run-level content is laid out from left-to-right, or in the appropriate cases, from right-to-left. As an example, multiple text runs in a paragraph are laid out horizontally with wrapping, in their respective fonts, as appropriate. Note that a paragraph is both a block-level content element and a run-level content container element, whereas a table is only a block-level content element and not a run-level content container element.

Run-Level Content Container Elements

Element	Element Name	Open XML SDK 2.0 Class Name Namespace: DocumentFormat.OpenXml.Wordprocessing
Paragraph	w:p	Paragraph

Paragraph

w:p

Paragraph

Run-Level Content

Run-level content is that content inside a paragraph that has formatting specific to a subsection of the paragraph. For example, a run is in a specific font. After accepting revisions, there are only three run-level content elements.

Run-Level Content Elements

Element	Element Name	Open XML SDK 2.0 Class Name Namespace: DocumentFormat.OpenXml.Wordprocessing
Text Run	w:r	Run
VML Drawing	w:pict	Picture
DrawingML Object	w:drawing	Drawing

Text Run

w:r

Run

VML Drawing

w:pict

Picture

DrawingML Object

w:drawing

Drawing

One non-intuitive aspect of this list of elements is that a Vector Markup Language (VML) drawing object or a DrawingML object are either run-level content or sub-run-level content. They can both also contain as a descendant the w:txbxContent element, which is a block-level content container.

Sub-Run-Level Content

Sub-run-level content consists of those WordprocessingML elements that are part of a run. As an example, a run can contain multiple text elements (w:t).

Sub-run-Level Content Elements

Element	Element Name	Open XML SDK 2.0 Class Name Namespace: DocumentFormat.OpenXml.Wordprocessing
Break	w:br	Break
Carriage Return	w:cr	CarriageReturnPicture
Date Block – Long Day Format	w:daylong	DayLong
Date Block – Long Day Format	w:daylong	DayLong
Date Block – Short Day Format	w:dayShort	DayShort
DrawingML object	w:drawing	Drawing
Date Block – Long Month Format	w:monthLong	MonthLong
Date Block – Short Month Format	w:monthShort	MonthShort
Non-breaking Hyphen Character	w:noBreakHyphen	NoBreakHyphen
Page Number Block	w:pgNum	PageNumber
VML Drawing	w:pict	Drawing
Absolute Position Tab Character	w:pTab	PositionalTab
Optional Hyphen Character	w:softHyphen	SoftHyphen
Symbol Character	w:sym	SymbolChar
Text	w:t	Text
Tab Character	w:tab	TabChar
Date Block – Long Year Format	w:yearlong	YearLong
Date Block – Short Year Format	w:yearShort	YearShort

This list also contains the VML drawing and DrawingML objects, which can contain a w:txbxContent element (a block-level content container) as a descendent.

How Varying Levels of Hierarchy Increase Complexity in Processing

A simple example can demonstrate the problem that we are trying to solve. The following document has a content control and a text box in the first paragraph:

Figure 1. Document with a content control and text box

Document with content control and text box

The following code example shows the markup for this paragraph. For more information about this markup, see ISO/IEC 29500-1:2008 or Standard ECMA-376 Office Open XML File Formats, Second Edition (ECMA-376 2nd edition).

Note

Extraneous markup is omitted to better illustrate the problem.

<w:p>
  <w:pPr>
    <w:ind w:right="3600"/>
  </w:pPr>
  <w:r>
    <w:rPr>
      <w:noProof/>
    </w:rPr>
    <mc:AlternateContent>
      <mc:Choice Requires="wps">
        <w:drawing>
          <!-- . . . -->
          <wps:txbx>
            <w:txbxContent>
              <w:p>
                <w:r>
                  <w:t>Text in text box</w:t>
                </w:r>
              </w:p>
            </w:txbxContent>
          </wps:txbx>
          <!-- . . . -->
        </w:drawing>
      </mc:Choice>
      <mc:Fallback>
        <w:pict>
          <!-- . . . -->
          <v:textbox>
            <w:txbxContent>
              <w:p>
                <w:r>
                  <w:t>Text in text box</w:t>
                </w:r>
              </w:p>
            </w:txbxContent>
          </v:textbox>
          <w10:wrap type="square"/>
          <!-- . . . -->
        </w:pict>
      </mc:Fallback>
    </mc:AlternateContent>
  </w:r>
  <w:sdt>
    <w:sdtContent>
      <w:r>
        <w:t>Text in content control.</w:t>
      </w:r>
    </w:sdtContent>
  </w:sdt>
  <w:r>
    <w:t xml:space="preserve"> Text following the content control.</w:t>
  </w:r>
</w:p>

In this example, the text for the text box is in the same paragraph as the text that is inside the content control. It is also in the same paragraph as text that is outside the content control. The content control causes the text elements to occur at different levels of hierarchy. You must write code that addresses this difference in hierarchy. This is one example that demonstrates the problem. There are several WordprocessingML abstractions that can cause text content to occur at different levels of indentation. Therefore, we want to develop a generalized solution to this problem.

Note

It is not correct to retrieve the text of the paragraph (w:p) element using the Value property.

using (WordprocessingDocument doc = WordprocessingDocument.Open("Test.docx", false))
{
    XElement root = doc.MainDocumentPart.GetXDocument().Root;
    XElement paragraph = root.Descendants(W.p).First();
    Console.WriteLine(paragraph.Value);
}

The returned text is incorrect.

Figure 2. Incorrect results of using the paragraph value

Incorrect results

The problem is not that we see the contents of the text box two times. The problem is that we see it at all. The text of the text box is not really part of the paragraph. It stands on its own.

You cannot iterate through the child runs of the paragraph because the content control causes the text runs to occur at different levels in the hierarchy of the markup.

using (WordprocessingDocument doc = WordprocessingDocument.Open("Test.docx", false))
{
    XElement root = doc.MainDocumentPart.GetXDocument().Root;
    XElement paragraph = root.Descendants(W.p).First();
    StringBuilder sb = new StringBuilder();
    foreach (XElement t in paragraph.Elements(W.r).Elements(W.t))
        sb.Append((string)t);
    Console.WriteLine(sb.ToString());
}

This does not include the text in the content control.

Figure 3. Incorrect results of concatenating child runs of a paragraph

Incorrect results

You could write code that handles this as a special case. However, this does not return the correct results for any of the other constructs that cause text content to occur at different levels of hierarchy. Instead, we need generalized abstractions that facilitate processing text content of documents.

Standard ECMA-376: Office Open XML File Formats, First Edition (ECMA-376) has the same issues associated with where content is in the XML hierarchy. The element that contains the text box as a descendant is a sibling to other sub-run-level content in the paragraph. The abstractions described in this article are equally applicable to ECMA-376 markup.

<w:p>
  <w:pPr>
    <w:ind w:right="3600"/>
  </w:pPr>
  <w:r>
    <w:pict>
      <v:shape . . .>
        <v:textbox>
          <w:txbxContent>
            <w:p>
              <w:r>
                <w:t>Text in text box</w:t>
              </w:r>
            </w:p>
          </w:txbxContent>
        </v:textbox>
        <w10:wrap type="square"/>
      </v:shape>
    </w:pict>
  </w:r>
  <w:sdt>
    <w:sdtContent>
      <w:r w:rsidR="00C578DC">
        <w:t>Text in content control.</w:t>
      </w:r>
    </w:sdtContent>
  </w:sdt>
  <w:r>
    <w:t xml:space="preserve"> Text following the content control.</w:t>
  </w:r>
</w:p>

Introducing the LogicalChildrenContent Axis Method

To address this problem, I wrote an axis method that returns the logical children content for an element. The logical children include content that is contained in other elements that increase the level of hierarchy of the content, such as a content control. Therefore, this logical children content axis differs from the LINQ to XML (or XPath) children axis. The actual elements that increase the level of hierarchy (w:sdt, w:fldsimple, w:hyperlink) are not included in the returned collection. We want the actual content, not those other elements that contain content.

Tip

I borrow the term, axis method, from LINQ to XML. Axis, in the context of XML documents, is the notion that for any given element, there is a specific set of related elements, and an axis method returns a collection of those related elements. For example, for a given XML element, there is a specific set of child elements, a specific set of descendants, and a specific set of ancestors. Descendants, child elements, and ancestors are the basis for some LINQ to XML axis methods.

The following listing highlights the elements that are in the returned collection if you retrieve the logical children content of the body element. The paragraph element inside the text box is not included in the logical children. This is because that paragraph is a logical child of the text box content element (w:txbxContent) that contains it. The text box content element is a logical child of the VML picture element (w:pict), which is a logical descendant of the run that contains it.

<w:body>
  <w:sdt>
    <w:sdtPr>
      <w:id w:val="172579038"/>
      <w:placeholder>
        <w:docPart w:val="DefaultPlaceholder_22675703"/>
      </w:placeholder>
    </w:sdtPr>
    <w:sdtEndPr/>
    <w:sdtContent>
     <w:p>
        <w:r>
          <w:t>Paragraph in content control.</w:t>
        </w:r>
      </w:p>
    </w:sdtContent>
  </w:sdt>
 <w:p>
    <w:pPr>
      <w:ind w:right="3600"/>
    </w:pPr>
    <w:r>
      <w:rPr>
        <w:noProof/>
      </w:rPr>
      <mc:AlternateContent>
        <mc:Choice Requires="wps">
          <w:drawing>
            . . .
            <wps:txbx>
              <w:txbxContent>
                <w:p>
                  <w:r>
                    <w:t>Text in text box</w:t>
                  </w:r>
                </w:p>
              </w:txbxContent>
            </wps:txbx>
            . . .
          </w:drawing>
        </mc:Choice>
        <mc:Fallback>
          <w:pict>
            . . .
            <v:textbox>
              <w:txbxContent>
               <w:p>
                  <w:r>
                    <w:t>Text in text box</w:t>
                  </w:r>
                </w:p>
              </w:txbxContent>
            </v:textbox>
            <w10:wrap type="square"/>
            . . .
          </w:pict>
        </mc:Fallback>
      </mc:AlternateContent>
    </w:r>
    <w:sdt>
      <w:sdtContent>
        <w:r>
          <w:t>Text in content control.</w:t>
        </w:r>
      </w:sdtContent>
    </w:sdt>
    <w:r>
      <w:t xml:space="preserve"> Text following the content control.</w:t>
    </w:r>
  </w:p>
 <w:p>
    <w:r>
      <w:t>Text in a following paragraph.</w:t>
    </w:r>
  </w:p>
</w:body>

The following listing highlights the logical children content of the second paragraph. None of the descendants of the first run are included in the logical children.

  . . .
 <w:p>
    <w:pPr>
      <w:ind w:right="3600"/>
    </w:pPr>
   <w:r>
      <w:rPr>
        <w:noProof/>
      </w:rPr>
      <mc:AlternateContent>
        <mc:Choice Requires="wps">
          <w:drawing>
            . . .
            <wps:txbx>
              <w:txbxContent>
                <w:p>
                  <w:r>
                    <w:t>Text in text box</w:t>
                  </w:r>
                </w:p>
              </w:txbxContent>
            </wps:txbx>
            . . .
          </w:drawing>
        </mc:Choice>
        <mc:Fallback>
          <w:pict>
            . . .
            <v:textbox>
              <w:txbxContent>
                <w:p>
                  <w:r>
                    <w:t>Text in text box</w:t>
                  </w:r>
                </w:p>
              </w:txbxContent>
            </v:textbox>
            <w10:wrap type="square"/>
            . . .
          </w:pict>
        </mc:Fallback>
      </mc:AlternateContent>
    </w:r>
    <w:sdt>
      <w:sdtContent>
       <w:r>
          <w:t>Text in content control.</w:t>
        </w:r>
      </w:sdtContent>
    </w:sdt>
   <w:r>
      <w:t xml:space="preserve"> Text following the content control.</w:t>
    </w:r>
  </w:p>

The logical child element of the first run in this paragraph is the mc:AlternateContent element.

   <w:r>
      <w:rPr>
        <w:noProof/>
      </w:rPr>
     <mc:AlternateContent>
        <mc:Choice Requires="wps">
          <w:drawing>
            . . .
            <wps:txbx>
              <w:txbxContent>
                <w:p>
                  <w:r>
                    <w:t>Text in text box</w:t>
                  </w:r>
                </w:p>
              </w:txbxContent>
            </wps:txbx>
            . . .
          </w:drawing>
        </mc:Choice>
        <mc:Fallback>
          <w:pict>
            . . .
            <v:textbox>
              <w:txbxContent>
                <w:p>
                  <w:r>
                    <w:t>Text in text box</w:t>
                  </w:r>
                </w:p>
              </w:txbxContent>
            </v:textbox>
            <w10:wrap type="square"/>
            . . .
          </w:pict>
        </mc:Fallback>
      </mc:AlternateContent>
    </w:r>

It is helpful to have mc:AlternateContent as one of the logical children content elements, because it can contain information about alternative approaches to processing the content. The logical child of the mc:AlternateContent element is its contained drawing:

    <w:r>
      <w:rPr>
        <w:noProof/>
      </w:rPr>
     <mc:AlternateContent>
        <mc:Choice Requires="wps">
         <w:drawing>
            . . .
            <wps:txbx>
              <w:txbxContent>
                <w:p>
                  <w:r>
                    <w:t>Text in text box</w:t>
                  </w:r>
                </w:p>
              </w:txbxContent>
            </wps:txbx>
            . . .
          </w:drawing>
        </mc:Choice>
        <mc:Fallback>
          <w:pict>
            . . .
            <v:textbox>
              <w:txbxContent>
                <w:p>
                  <w:r>
                    <w:t>Text in text box</w:t>
                  </w:r>
                </w:p>
              </w:txbxContent>
            </v:textbox>
            <w10:wrap type="square"/>
            . . .
          </w:pict>
        </mc:Fallback>
      </mc:AlternateContent>
    </w:r>

The logical child of the DrawingML object is the text box contents (w:txbxContents). Its child is the enclosed paragraph. By defining the logical children axis in this manner, it is easy to assemble the text accurately for any paragraph.

Implementing the DescendantsTrimmed Axis Method

The first step to implement the logical children axis method is to implement a method that returns a collection of the descendant elements where the descendants are trimmed. Any element that is a descendant of a specified tag is not included in the returned collection. Another overload of the DescendantsTrimmed method takes a delegate as an argument. It lets you specify a lambda expression as a predicate so that you can trim based on several tags. I define the semantics of this method in such a way that the trimmed elements themselves are included in the returned collection.

The following code example demonstrates the semantics of the DescendantsTrimmed axis method. In this axis method, elements that are descendants of the txbxContent element are trimmed. The code example that displays the element name for each element counts the ancestors to correctly indent element names.

XElement doc = XElement.Parse(
    @"<body>
        <p>
          <r>
            <t>Text before the text box.</t>
          </r>
          <r>
            <pict>
              <txbxContent>
                <p>
                  <r>
                    <t>Text in a text box.</t>
                  </r>
                </p>
              </txbxContent>
            </pict>
          </r>
          <r>
            <t>Text after the text box.</t>
          </r>
        </p>
      </body>");
foreach (XElement c in doc.DescendantsTrimmed("txbxContent"))
    Console.WriteLine("{0}{1}", "".PadRight(c.Ancestors().Count() * 2), c.Name);

This example displays an indented list of the name of each element in the returned collection.

  p
    r
      t
    r
      pict
        txbxContent
    r
      t

Defining Logical Children

Using the DescendantsTrimmed axis method, you can implement an axis method that returns only the logical children of a specific set of elements. Here is how I define logical children:

The only logical child of the w:document element is the w:body element.
The logical children of a block-level content container (w:body, w:tc, w:txbxContent) are block-level content (w:p, w:tbl).
The logical children of a table (w:tbl) are its rows (w:tr).
The logical children of a row (w:tr) are its cells (w:tc).
The logical children of a paragraph (w:p) are its runs (w:r).
The logical children of a run (w:r) are sub-run-level content (w:t, w:pict, w:drawing, etc.) See the list earlier in this article. In addition, to accommodate Office 2010 and ISO/IEC 29500, the mc:AlternateContent element is also a child of a run. I implemented the accompanying code so that it works with both ECMA-376 1st edition and ISO/IEC 29500 (ECMA-376 2nd edition).
The logical child of an alternate content element is a drawing or picture in the mc:Choice element. You want to process the contents of the mc:Choice element, not the mc.Fallback element.
The logical children of a VML drawing object (w:pict) or a DrawingML object (w:drawing) is any contained text box content elements (w:txbxContent). If you have a scenario where you must process other specific parts of a VML object or DrawingML object, you can redefine the LogicalChildrenContent method to include the elements that you must process in the returned collection.

Using the LogicalChildrenContent Axis Method

Before examining the implementation of the LogicalChildrenContent method, it is useful to see its use.

The following figure shows the sample document that presented challenges.

Figure 4. Document with a content control and a text box

Document with content control and text box

ExamineDocumentContent Example

This first example recursively iterates through all logical content in a document, displaying the name of each element, with correct indenting. If the element is a text element (w:t), then the function prints the text content of the element.

Notice that this example first accepts revisions by calling the RevisionAccepter.AcceptRevisions method. This example uses an approach of opening the word-processing document by first reading the document into a byte array, and then initializing a resizable memory stream from the byte array. This allows the example to open the document with the editable parameter set to true, which allows the example to accept revisions. If the example were to directly open the document for editing, then the example would modify the existing document by accepting revisions, which very well may not be a desired side-effect of running it. If the example were to open the document in read-only mode, accepting revisions fails (throws an exception).

static void IterateContent(XElement element, int depth)
{
    if (element.Name == W.t)
        Console.WriteLine("{0}{1} >{2}<", "".PadRight(depth * 2), element.Name.LocalName,
            (string)element);
    else
        Console.WriteLine("{0}{1}", "".PadRight(depth * 2), element.Name.LocalName);
    foreach (XElement item in element.LogicalChildrenContent())
        IterateContent(item, depth + 1);
}

static void Main(string[] args)
{
    byte[] docByteArray = File.ReadAllBytes("Test.docx");
    using (MemoryStream memoryStream = new MemoryStream())
    {
        memoryStream.Write(docByteArray, 0, docByteArray.Length);
        using (WordprocessingDocument doc =
            WordprocessingDocument.Open(memoryStream, true))
        {
            RevisionAccepter.AcceptRevisions(doc);
            IterateContent(doc.MainDocumentPart.GetXDocument().Root, 0);
        }
    }
}

When I run this example for the problem document, I see the following.

document
  body
    p
      r
        t >Paragraph in <
      r
        t >content control.<
    p
      r
        AlternateContent
          drawing
            txbxContent
              p
                r
                  t >Text in text box<
      r
        t >Text in content control. <
      r
        t >Text following the content control.<
    p
      r
        t >Text in a following<
      r
        t > paragraph.<

We see that through various editing sessions, various runs were split into multiple runs. We can see the text box and its content at the appropriate location.

We can implement the same axis methods using the strongly-typed object model of the Welcome to the Open XML SDK 2.0 for Microsoft Office. The code to use the logical content axis looks as follows.

static void IterateContent(OpenXmlElement element, int depth)
{
    if (element.GetType() == typeof(Text))
        Console.WriteLine("{0}{1} >{2}<", "".PadRight(depth * 2),
            element.GetType().Name, ((Text)element).Text);
    else
        Console.WriteLine("{0}{1}", "".PadRight(depth * 2),
            element.GetType().Name);
    foreach (var item in element.LogicalChildrenContent())
        IterateContent(item, depth + 1);
}

static void Main(string[] args)
{
    byte[] docByteArray = File.ReadAllBytes("Test7.docx");
    using (MemoryStream memoryStream = new MemoryStream())
    {
        memoryStream.Write(docByteArray, 0, docByteArray.Length);
        using (WordprocessingDocument doc =
            WordprocessingDocument.Open(memoryStream, true))
        {
            RevisionAccepter.AcceptRevisions(doc);
            IterateContent(doc.MainDocumentPart.Document, 0);
        }
    }
}

When I run this example for the problem document, I see the following.

Document
  Body
    Paragraph
      Run
        Text >Paragraph in <
      Run
        Text >content control.<
    Paragraph
      Run
        AlternateContent
          Drawing
            TextBoxContent
              Paragraph
                Run
                  Text >Text in text box<
      Run
        Text >Text in content control. <
      Run
        Text >Text following the content control.<
    Paragraph
      Run
        Text >Text in a following<
      Run
        Text > paragraph.<

Retrieving the Text of Paragraphs

You often want to process a document, and in a single operation, retrieve all paragraphs, all runs under each paragraph, and all text elements for each run, and assemble the associated text of each paragraph.

To make this as easy as possible, I will write another overload of the LogicalChildrenContent method. It is useful to write it as an extension method that takes as an argument a collection of content elements, and returns as a collection the set of logical child elements of each element in the source collection. This extension method is comparable to the Elements extension methods in LINQ to XML that return all child elements of every element in a source collection. The extension method is very easy to implement.

public static IEnumerable<XElement> LogicalChildrenContent(this IEnumerable<XElement> source)
{
    foreach (XElement e1 in source)
        foreach (XElement e2 in e1.LogicalChildrenContent())
            yield return e2;
}

The same axis method, implemented by using the strongly-typed object model of the Welcome to the Open XML SDK 2.0 for Microsoft Office looks as follows.

public static IEnumerable<OpenXmlElement> LogicalChildrenContent(
    this IEnumerable<OpenXmlElement> source)
{
    foreach (OpenXmlElement e1 in source)
        foreach (OpenXmlElement e2 in e1.LogicalChildrenContent())
            yield return e2;
}

It is also useful to use another extension method, the StringConcatenate method, which is a string aggregation operation.

public static string StringConcatenate(this IEnumerable<string> source)
{
    StringBuilder sb = new StringBuilder();
    foreach (string s in source)
        sb.Append(s);
    return sb.ToString();
}

Now we can write a small program to retrieve all child paragraphs of the body element, and retrieve the text of each paragraph. By using the RevisionAccepter method, together with the LogicalChildrenContent axes, we know that we can correctly retrieve the text of each paragraph.

static void Main(string[] args)
{
    byte[] docByteArray = File.ReadAllBytes("Test.docx");
    using (MemoryStream memoryStream = new MemoryStream())
    {
        memoryStream.Write(docByteArray, 0, docByteArray.Length);
        using (WordprocessingDocument doc =
            WordprocessingDocument.Open(memoryStream, true))
        {
            RevisionAccepter.AcceptRevisions(doc);
            XElement root = doc.MainDocumentPart.GetXDocument().Root;
            XElement body = root.LogicalChildrenContent().First();
            foreach (XElement blockLevelContentElement in body.LogicalChildrenContent())
            {
                if (blockLevelContentElement.Name == W.p)
                {
                    var text = blockLevelContentElement
                        .LogicalChildrenContent()
                        .Where(e => e.Name == W.r)
                        .LogicalChildrenContent()
                        .Where(e => e.Name == W.t)
                        .Select(t => (string)t)
                        .StringConcatenate();
                    Console.WriteLine("Paragraph text >{0}<", text);
                    continue;
                }
                // If element is not a paragraph, it must be a table.
                Console.WriteLine("Table");
            }
        }
    }
}

When I run this program for the problem document, I see the following.

Paragraph text >Paragraph in content control.<
Paragraph text >Text in content control. Text following the content control.<
Paragraph text >Text in a following paragraph.<

The example that uses the Welcome to the Open XML SDK 2.0 for Microsoft Office looks as follows.

static void Main(string[] args)
{
    byte[] docByteArray = File.ReadAllBytes("Test7.docx");
    using (MemoryStream memoryStream = new MemoryStream())
    {
        memoryStream.Write(docByteArray, 0, docByteArray.Length);
        using (WordprocessingDocument doc =
            WordprocessingDocument.Open(memoryStream, true))
        {
            RevisionAccepter.AcceptRevisions(doc);
            OpenXmlElement root = doc.MainDocumentPart.Document;
            Body body = (Body)root.LogicalChildrenContent().First();
            foreach (OpenXmlElement blockLevelContentElement in
                body.LogicalChildrenContent())
            {
                if (blockLevelContentElement is Paragraph)
                {
                    var text = blockLevelContentElement
                        .LogicalChildrenContent()
                        .OfType<Run>()
                        .Cast<OpenXmlElement>()
                        .LogicalChildrenContent()
                        .OfType<Text>()
                        .Select(t => t.Text)
                        .StringConcatenate();
                    Console.WriteLine("Paragraph text >{0}<", text);
                    continue;
                }
                // If element is not a paragraph, it must be a table.
                Console.WriteLine("Table");
            }
        }
    }
}

This example did not examine runs for descendant block-level content containers, so the example as designed did not display the text of the text box.

Two Useful Overloads of the LogicalChildrenContent Axis Method

You can simplify the last example by defining two additional overloads of the LogicalChildrenContent axis method. A common operation is to retrieve all of the runs of a paragraph and to retrieve all of the text elements of a run. Therefore, if we define two additional extension methods that filter by a specified tag name, it further simplifies the code.

public static IEnumerable<XElement> LogicalChildrenContent(this XElement element,
    XName name)
{
    return element.LogicalChildrenContent().Where(e => e.Name == name);
}

public static IEnumerable<XElement> LogicalChildrenContent(
    this IEnumerable<XElement> source, XName name)
{
    foreach (XElement e1 in source)
        foreach (XElement e2 in e1.LogicalChildrenContent(name))
            yield return e2;
}

When using these extension methods, the query simplifies as follows.

var text = blockLevelContentElement
   .LogicalChildrenContent(W.r)
   .LogicalChildrenContent(W.t)
    .Select(t => (string)t)
    .StringConcatenate();

This query produces the same output as the previous example.

These additional extension methods implemented in the Welcome to the Open XML SDK 2.0 for Microsoft Office are as follows.

public static IEnumerable<OpenXmlElement> LogicalChildrenContent(
    this OpenXmlElement element, System.Type typeName)
{
    return element.LogicalChildrenContent().Where(e => e.GetType() == typeName);
}

public static IEnumerable<OpenXmlElement> LogicalChildrenContent(
    this IEnumerable<OpenXmlElement> source, Type typeName)
{
    foreach (OpenXmlElement e1 in source)
        foreach (OpenXmlElement e2 in e1.LogicalChildrenContent(typeName))
            yield return e2;
}

The simplified query looks as follows.

var text = blockLevelContentElement
   .LogicalChildrenContent(typeof(Run))
   .LogicalChildrenContent(typeof(Text))
   .OfType<Text>()
    .Select(t => t.Text)
    .StringConcatenate();

Identity of XML Elements Returned by the LogicalChildrenContent Method

There is one important note to make about the elements returned by the LogicalChildrenContent method. The elements are the actual elements in the WordprocessingML document, not copies or clones. This means, for example, that if you want to additionally filter on the various properties of styles, it is easy to do.

Searching Documents for Text

We can now write an example that searches for a specific string in a document. This example works correctly if the document contains revision tracking, content controls, hyperlinks, or any of the other elements that present challenges when assembling the text of paragraphs. Further, it correctly finds text that spans block-level content containers.

static void IterateContentAndSearch(XElement element, string searchString)
{
    if (element.Name == W.p)
    {
        string paragraphText = element
            .LogicalChildrenContent(W.r)
            .LogicalChildrenContent(W.t)
            .Select(s => (string)s)
            .StringConcatenate();
        if (paragraphText.Contains(searchString))
            Console.WriteLine("Found {0}, paragraph: >{1}<", searchString, paragraphText);
    }
    foreach (XElement item in element.LogicalChildrenContent())
        IterateContentAndSearch(item, searchString);
}

static void Main(string[] args)
{
    byte[] docByteArray = File.ReadAllBytes("Test.docx");
    using (MemoryStream memoryStream = new MemoryStream())
    {
        memoryStream.Write(docByteArray, 0, docByteArray.Length);
        using (WordprocessingDocument doc =
            WordprocessingDocument.Open(memoryStream, true))
        {
            RevisionAccepter.AcceptRevisions(doc);
            IterateContentAndSearch(doc.MainDocumentPart.GetXDocument().Root, "control");
        }
    }
}

The same example using the Welcome to the Open XML SDK 2.0 for Microsoft Office looks as follows.

static void IterateContentAndSearch(OpenXmlElement element, string searchString)
{
    if (element is Paragraph)
    {
        string paragraphText = element
            .LogicalChildrenContent(typeof(Run))
            .LogicalChildrenContent(typeof(Text))
            .OfType<Text>()
            .Select(s => s.Text)
            .StringConcatenate();
        if (paragraphText.Contains(searchString))
            Console.WriteLine("Found {0}, paragraph: >{1}<", searchString, paragraphText);
    }
    foreach (OpenXmlElement item in element.LogicalChildrenContent())
        IterateContentAndSearch(item, searchString);
}

static void Main(string[] args)
{
    byte[] docByteArray = File.ReadAllBytes("Test.docx");
    using (MemoryStream memoryStream = new MemoryStream())
    {
        memoryStream.Write(docByteArray, 0, docByteArray.Length);
        using (WordprocessingDocument doc =
            WordprocessingDocument.Open(memoryStream, true))
        {
            RevisionAccepter.AcceptRevisions(doc);
            IterateContentAndSearch(doc.MainDocumentPart.Document, "control");
        }
    }
}

Conclusion

When you develop a program that processes Open XML WordprocessingML, it is often useful to consider just the actual content of the document. This article defines the elements that I consider to contain the logical content of a document. I also define four overloads of an axis method, LogicalChildrenContent.

It is important for simple, robust processing of Open XML WordprocessingML documents to accept tracked revisions. This enables us to write code that disregards more than 40 elements and attributes (including some with complex semantics) that are used to track revisions. Using these axis methods in combination with accepting tracked revisions enables us to write small programs that reliably extract content from Open XML WordprocessingML documents.

Additional Resources

Download code

See the Open XML Developer Center on MSDN for articles, how-to videos, and links to many blog posts. The following links provide important information for getting started with the Open XML SDK 2.0:

Download: Open XML SDK 2.0
Article: Creating Documents by Using the Open XML Format SDK Version 2.0 CTP (Part 1 of 3)
Article: Creating Documents by Using the Open XML Format SDK 2.0 CTP (Part 2 of 3)
Article: Creating Documents by Using the Open XML Format SDK 2.0 CTP (Part 3 of 3)

Mastering Text in Open XML WordprocessingML Documents

Introduction

Understanding Text Content in WordprocessingML

Best Practice: Accept Revisions before Processing

Understanding WordprocessingML Abstractions

Block-Level Content Containers

Block-Level Content

Run-Level Content Containers

Run-Level Content

Sub-Run-Level Content

How Varying Levels of Hierarchy Increase Complexity in Processing

Introducing the LogicalChildrenContent Axis Method

Implementing the DescendantsTrimmed Axis Method

Defining Logical Children

Using the LogicalChildrenContent Axis Method

ExamineDocumentContent Example

Retrieving the Text of Paragraphs

Two Useful Overloads of the LogicalChildrenContent Axis Method

Identity of XML Elements Returned by the LogicalChildrenContent Method

Searching Documents for Text

Conclusion

Additional Resources

Additional resources