2.2.3.2 Extracting Encapsulated HTML from RTF

The de-encapsulating RTF reader MUST parse the RTF document as specified in [MSFT-RTF]. Before attempting de-encapsulation, the reader MUST first recognize the encapsulated content, as specified in section 2.2.3.1.

To be able to correctly convert text inside RTF, the de-encapsulating RTF reader SHOULD process control words and other information in RTF that affect the interpretation of text runs in RTF and a code page of such text runs. For more details about control words and text runs, see [MSFT-RTF]. In particular, the de-encapsulating RTF reader SHOULD use the default code page, as specified in the RTF header, and it SHOULD use the code page information, as specified for each font in a font table. It also SHOULD track changes to the current font and use the appropriate code page for the currently selected font. The de-encapsulating RTF reader MUST skip other parts of the RTF header, as specified in [MSFT-RTF].

If the de-encapsulating RTF reader encounters an HTMLTAG destination group, it SHOULD ignore any HTMLTagParameter HTML fragments in an HTMLTAG control word. Any CONTENT HTML fragments inside HTMLTAG destination groups MUST be copied to a destination HTML document, as follows:

  • Any RTF escapes and RTF control words that represent Unicode characters, as specified in section 2.1.3.1.4.2, MUST be converted to appropriate text and such text MUST be copied to the target HTML document. RTF escapes SHOULD be unescaped and the resulting bytes interpreted in a default RTF code page, as specified in [MSFT-RTF]. Unicode characters produced from Unicode escapes (\uN control word) and other control words SHOULD be interpreted as Unicode characters.

  • Any other RTF control words within a CONTENT HTML fragment inside an HTMLTAG destination group SHOULD be ignored.

Any remaining text within a CONTENT HTML fragment inside an HTMLTAG destination group MUST be copied to the target HTML document. To interpret such text, the de-encapsulating RTF reader MUST use the default RTF code page, as specified in the RTF header. For more details about code page support, see [MSFT-RTF].

Outside of an HTMLTAG destination group, the de-encapsulating RTF reader MUST do the following:

  • Ignore and skip any text and RTF control words that are suppressed by any HTMLRTF control word other than the \fN control word. The de-encapsulating RTF reader SHOULD track the current font even when the corresponding \fN control word is inside of a fragment that is disabled with an HTMLRTF control word.

  • Ignore and skip any standard RTF destination groups that do not produce visible text (such as \colortbl groups), except for the \fonttbl group. The de-encapsulating RTF reader SHOULD process a font table group and at least remember the code page that corresponds to each font.

  • Ignore any ignorable destination groups (that is, groups that start with "\*") other than the HTMLTAG destination group.

  • Copy the remaining content to the target HTML document as follows:

    • Any RTF escapes and RTF keywords that represent Unicode characters MUST be converted to appropriate text, and such text MUST be copied to the target HTML document. For a complete list and syntax of such escapes and control words, see [MSFT-RTF]. RTF escapes SHOULD be unescaped and the resulting bytes interpreted in a code page that corresponds to the current font. Unicode characters produced from Unicode escapes (\uN control word) and other control words SHOULD be interpreted as Unicode characters.

    • Any \par and \line RTF control words MUST be converted to CRLF and such CRLF sequences MUST be copied to the target HTML document.

    • Any \tab RTF control words MUST be converted to the horizontal tab (%x09) character, and such characters MUST be copied to the target HTML document.

    • Any other RTF control words SHOULD be ignored.

    • Any remaining text MUST be copied to the target HTML document. Text SHOULD be interpreted in a code page that corresponds to the currently selected font.