Using Nested Content Controls for Data and Content Extraction from Open XML WordprocessingML Documents

Data and content extraction is one of the scenarios where content controls are very useful.  Data extraction is when you are extracting specific numbers or string values from a document.  Content extraction is when you are extracting formatted WordprocessingML tables and paragraphs, and constructing another document from that content.

This blog is inactive.
New blog:

Blog TOCFor example, an oil company may receive a large number of reports from field personnel and vendors.  They want to extract information from these reports in a standardized way, perhaps to build a consolidated report, or to populate a searchable database.  Further, they want to leave the original document alone as much as possible, except to indicate the source of various paragraphs or numbers.  Content controls are a great solution for this problem.  As the oil company receives each report, appropriate personnel open the report, select areas of text, and insert content controls that surround the content, and then submit the document for processing.  Key to this scenario is nesting content controls, which is supported in Word.

Important note: In order to nest content controls, the containing content control must be a rich-text content control. You create one of these using the upper-left button in the Controls section of the Developer tab. Thanks, Darin.

However, the default view of content controls only shows them when the insertion point is in them.  In the following document, there is also a content control around the production number (1800), but we can't see it because the insertion point isn't in it.

The solution to this is to turn on 'Design Mode':

After turning on design mode, the document looks like this:

If you are building a specialized document processing system like this, a good approach would be to create a custom tab on the ribbon.  Users then could insert content controls with various titles/tags with a single click.  If you require that users enter more information about a content control than just the title and tag, you could enhance this user interface by creating a task pane, and allowing users to edit auxiliary information about the content control.  Associating Data with Content Controls provides a proof-of-concept example of this approach.

In your solution, you may want to take specific action when a content control is inserted, deleted, or updated.  For example, you can write an event handler that will be called when the user navigates out of a content control.  You can then validate the data entered by the user.  There is one dynamic of programming with nested content controls that you should be aware of.  Only the most nested content control receives the 'enter' and 'exit' events.  However, this isn't a problem, as you can easily find out the parent content control of any content control using the Word.ContentControl.ParentContentControl property.  Further, you can find out the list of all containing content controls for the current selection, so you can then determine if the selection moved out of a parent content control.

Content control events are associated with the Document object.  You can write event handlers for the following events:

Event Handler


ContentControlAfterAdd Event

Occurs after adding a content control to a document.

ContentControlBeforeContentUpdate Event

Occurs before updating the content in a content control, only when the content comes from the Office XML data store.

ContentControlBeforeDelete Event

Occurs before removing a content control from a document.

ContentControlBeforeStoreUpdate Event

Occurs before updating the document's XML data store with the value of a content control.

ContentControlOnEnter Event

Occurs when a user enters a content control.

ContentControlOnExit Event

Occurs when a user leaves a content control.