Copy Paste HTML From MS Word: IE's DHTML Editing Control (in a .NET WinApp)

When copy/pasting from MS Word, the HTML it generates is really messy and can't be used verbatim.  This has been a pain of mine and many others.  I've found that many 3rd party controls, and some client-side blogging tools (like BlogJet) have a miraculous way of converting messy MS Word HTML into something that works well (displays correctly, yet still bloated).  The question is how?!?

Deducing a Solution
Try opening this little "Hello World" MS Word Doc, select all (Ctrl-A) and copy, and view the clipboard contents (C# source)… notice the "HTML Format" contents?  Now paste into FreeTextBox, RichTextBox, PowerPack's, or your own blogging tool.  Switch to HTML view and notice the HTML has been nicely transformed!  Each of these retail tools has transformed it into practically the same result.

This led me to believe these are all using the same base control.  Sure enough, IE 5.0 introduced a DHTML Editing Control that does the work.

Use it Yourself
So, how can you use this in your own app?  It is actually quite easy.  You can drop a new .NET 2.0 WebBrowser control in your app and put it into Design Mode.  Here is an app that demonstrates this, IE DHTML Editing Control Example (C# source).  Try the sample app with the same experimentation step above and you’ll see its the same control.

The ShDocVw ActiveX and MSHTML DOM are extensive COM objects and only a subset of members are wrapped in the .NET 2.0 control, so getting to the underlying ActiveX control is necessary.

  1. Put a WebBrowser control on a form and call it "web"
  2. Add a project reference to the COM library "Microsoft HTML Object Library"
  3. Use the code below to initialize into design mode.

// Load the MSHTML component

web.Navigate("about:blank");

// Release control to the system

Application.DoEvents();

// Turn ON Design Mode

((mshtml.HTMLDocument) web.Document.DomDocument).designMode = "On";

Fixing Word's HTML
This technique actually converts most any HTML block in the clipboard (from IE, Word, Excel, Power Point, etc).  It does not save embedded images.  IE apparently takes the style sheets that may be defined in a <style> block and puts the styling in the HTML elements so that the block of resulting HTML code will render correctly without dependance on style sheets or style blocks.  It isn’t “inteligent” of the type of styling used, for example, it won’t convert a bulletted list from Word into <ul><li></li></ul> code, but it will preserve the visual formatting.  Use the IE DHTML Editing Control Example (C# source) example to play with it yourself and see just what HTML is rendered.

You can then use this feature of the IE control to convert blocks of HTML from MS Word.  Here is a small simple app, Convert Clipboard HTML (C# source), that does just this.   It reads the HTML contents of the clipboard, pushes it through the IE DHTML control, and puts the resulting HTML code back into the clipboard.  You can then paste the resulting HTML code into your HTML editor.  In the case of Windows Live Writer, this would be the “HTML Code” view.  Here’s how to make this app:

  1. Create a Windows Application
  2. Remove the default Form
  3. Add a project reference to the COM libraries “Microsoft HTML Object Library” and “Microsoft Internet Controls”
  4. Modify Program.cs in this way 

static class Program

{

   [STAThread]

   static void Main(string[] args)

   {

      // Get a web browser

      WebBrowser web = new WebBrowser();

      // Load the MSHTML component into the web browser control

      web.Navigate("about:blank");

      Application.DoEvents();

      // Change into design mode

      ((mshtml.HTMLDocument) web.Document.DomDocument).designMode = "On";

      Application.DoEvents();

      // Paste the clipboard contents into the control

      object o = System.Reflection.Missing.Value;

      ((SHDocVw.WebBrowser)web.ActiveXInstance).ExecWB(

         SHDocVw.OLECMDID.OLECMDID_PASTE,

         SHDocVw.OLECMDEXECOPT.OLECMDEXECOPT_DODEFAULT, ref o, ref o);

      Application.DoEvents();

      // Extract the resulting HTML

      Clipboard.SetText(web.Document.Body.InnerHtml);

      // Inform the user the operation has completed

      if (args.Length == 0 || args[0].Equals("/nomsg",

         StringComparison.InvariantCultureIgnoreCase))

         MessageBox.Show("The contents of the clipboard have " +

            "been converted into an HTML block.\n\n", "Convert Clipboard HTML",

            MessageBoxButtons.OK, MessageBoxIcon.Information);

   }

}

Other Methods?
There are many other tricks to cleaning up MS Word HTML's format.  Including using regular expressions, Tidy HTML, 3rd party tools, Office 2000 HTML Filter 2.0, or using MS Word 2007.  However, these are targeted to the HTML page as a whole.  This post is about pulling a segment of a Word Doc via a clipboard copy.  Using this IE control is the only consistently effective way I've found to do this.

Vista Support
The control has been removed from Vista.  However, you can install it separately.

Tools & Source Code
Here are a few of the little tools I made for this blog post. They’ll install under a program group called “Noah Coad” and can be uninstalled from “Add & Remove Programs”.

Other Resources