How to remove invalid character in incoming XML message using custom pipeline component

<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> 

NOTE: This article is
migrated from Blog AsiaTech

Date: 2010-6-16 3:49 PM

Orignal URL: https://blogs.msdn.com/b/asiatech/archive/2010/06/16/remove-invalid-character-in-incoming-xml-message-using-custom-pipeline-component.aspx

<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> 

BizTalk server is usually
used as message bus for different platforms and applications. In most
scenarios, BizTalk stores and processes messages in XML format. Internally, it
calls .NET XML library to do the job. Thus it follows W3C XML standard to
validate message and check encoding.

However, in some scenarios,
messages from different platforms or applications didn’t strictly follow W3C
XML standard and may include invalid characters. These characters may lead to
BizTalk message processing failure in various stages, for example, XML Receive
pipelines, message mapping. The below is one sample error message you may find
in BizTalk event log:

Unable to read the stream produced by the pipeline. Details: '',
hexadecimal value 0x0F, is an invalid character.

*Note: According to W3C, XML
processors should accept any character in following range

Char ::= #x9 | #xA | #xD |
[#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

For more detail, please
refer to:

https://www.w3.org/TR/REC-xml/#charsets

Generally speaking, the
formal way of solving this problem is to fix the source system and prevent it
from sending out invalid characters in XML message. Unfortunately, this way is
not always available because the source system may not under our control. So it
may be hard to persuade the owner to spend their money solving the problem.
Writing a custom pipeline to remove the invalid characters before BizTalk
processing the message would be one workaround in this scenario.

The below is the sample
implementation for Execute interface of pipeline component. Here are the
several stages in this sample:

  • · Read original message stream into text
  • · Use Regular expression to remove invalid characters
  • · Add Unicode Byte Order Mark
  • · Assign the new stream to message

Note: Here just a sample of
using Unicode encoding. User can using UTF-8 instead or even implement design
time support to switch between different supported encoding format.

public IBaseMessage
Execute(IPipelineContext pc, IBaseMessage inmsg)

{

string
oldXMLText,newXMLText;

Stream newStr;

IBaseMessagePart bodyPart =
inmsg.BodyPart;

if (bodyPart != null)

{

using (StreamReader sr = new
StreamReader(bodyPart.GetOriginalDataStream()))

{

//Remove Invalid Character

oldXMLText = sr.ReadToEnd();

string re =
@"[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]";

newXMLText =
System.Text.RegularExpressions.Regex.Replace(oldXMLText, re, "");

//Add Byte Order Mark

byte[] b = new
byte[newXMLText.Length * 2 + 2];

b[0] = 0xFF;

b[1] = 0xFE;

byte[] textb =
System.Text.UnicodeEncoding.Unicode.GetBytes(newXMLText);

for (int i = 0; i <
textb.Length; i++)

{

b[i + 2] = textb[i];

}

newStr = new
MemoryStream(b);

newStr.Flush();

bodyPart.Data = newStr;

pc.ResourceTracker.AddResource(newStr);

}

}

return inmsg;

}

Best Regards,

Bryan Yang