How to: Convert RTF to Plain Text (C# Programming Guide)

Rich Text Format (RTF) is a document format developed by Microsoft in the late 1980s to enable the exchange of documents across operating systems. Both Microsoft Word and WordPad can read and write RTF documents. In the .NET Framework, you can use the RichTextBox control to create a word processor that supports RTF and enables a user to apply formatting to text in a WYSIWIG manner.

You can also use the RichTextBox control to programmatically remove the RTF formatting codes from a document and convert it to plain text. You do not need to embed the control in a Windows Form to perform this kind of operation.

To use the RichTextBox control in a project

  1. Add a reference to System.Windows.Forms.dll.

  2. Add a using directive for the System.Windows.Forms namespace (optional).

Example

The following example converts a sample RTF file to plain text. The file contains RTF formatting (such as font information), four Unicode characters, and four extended ASCII characters. The example code opens the file, passes its content to a RichTextBox as RTF, retrieves the content as text, displays the text in a MessageBox, and outputs the text to a file in UTF-8 format.

The MessageBox and the output file contain the following text:

The Greek word for "psyche" is spelled ψυχή. The Greek letters are encoded in Unicode.
These characters are from the extended ASCII character set (Windows code page 1252):  âäӑå
// Use NotePad to save the following RTF code to a text file in the same folder as   
// your .exe file for this project. Name the file test.rtf.  
/*
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fswiss\fcharset0 Arial;}
{\f1\fnil\fprq1\fcharset0 Courier New;}{\f2\fswiss\fprq2\fcharset0 Arial;}}
{\colortbl ;\red0\green128\blue0;\red0\green0\blue0;}
{\*\generator Msftedit 5.41.21.2508;}
\viewkind4\uc1\pard\f0\fs20 The \i Greek \i0 word for "psyche" is spelled \cf1\f1\u968?\u965?\u967?\u942?\cf2\f2 . The Greek letters are encoded in Unicode.\par
These characters are from the extended \b ASCII \b0 character set (Windows code page 1252):  \'e2\'e4\u1233?\'e5\cf0\par }
*/ 
class ConvertFromRTF
{
    static void Main()
    {
        // If your RTF file isn't in the same folder as the .exe file for the project,  
        // specify the path to the file in the following assignment statement.  
        string path = @"test.rtf";

        //Create the RichTextBox. (Requires a reference to System.Windows.Forms.)
        System.Windows.Forms.RichTextBox rtBox = new System.Windows.Forms.RichTextBox();

        // Get the contents of the RTF file. When the contents of the file are   
        // stored in the string (rtfText), the contents are encoded as UTF-16.  
        string rtfText = System.IO.File.ReadAllText(path);

        // Display the RTF text. This should look like the contents of your file.
        System.Windows.Forms.MessageBox.Show(rtfText);

        // Use the RichTextBox to convert the RTF code to plain text.
        rtBox.Rtf = rtfText;
        string plainText = rtBox.Text;

        // Display the plain text in a MessageBox because the console can't   
        // display the Greek letters. You should see the following result:  
        //   The Greek word for "psyche" is spelled ψυχή. The Greek letters are
        //   encoded in Unicode. 
        //   These characters are from the extended ASCII character set (Windows 
        //   code page 1252): âäӑå
        System.Windows.Forms.MessageBox.Show(plainText);

        // Output the plain text to a file, encoded as UTF-8. 
        System.IO.File.WriteAllText(@"output.txt", plainText);
    }
}

RTF characters are encoded in eight bits. However, users can specify Unicode characters in addition to extended ASCII characters from specified code pages. Because the RichTextBox.Text property is of type string, the characters are encoded as Unicode UTF-16. Any extended ASCII characters and Unicode characters from the source RTF document are correctly encoded in the text output.

If you use the File.WriteAllText method to write the text to disk, the text will be encoded as UTF-8 (without a Byte Order Mark).

See Also

Reference

System.Windows.Forms.RichTextBox

Other Resources

Strings (C# Programming Guide)