Thoughts on Legacy Character Sets

Article
11/03/2009

One of the things I have taken from the IE XSS Filter project is a healthy fear of legacy character sets. If you've followed Chris Weber, Scott Stender, or Yosuke Hasegawa’s work, you know that even Unicode is... interesting. But at least in the Unicode world there are standards and evolving best practices dictating how clients and servers should behave.

How about the rest of the character sets commonly used on the web today? For example, if a web server produces ISO 2022 responses...
- How are escape sequences handled on input to the application?
- How are escape sequences handled in various components through which the input travels?
- How are escape sequences handled in server-side filtering code?
- How are escape sequences handled at any of the various browser clients?

You may ask the same questions about invalid multi-byte sequences, various character set eccentricities, etc. Character set handling may not be readily apparent at the highest levels of the stack, but transformations between character sets are actually common at the platform level on both the client and server.

The answers to the questions above have a real impact on an application's ability to defend itself from XSS. In order for developers to prevent XSS they must authoritatively block any XSS attack vector. There are more complicated constructs that may be useful as vectors depending on the injection context. For anyone who's written some code intending to prevent XSS, this is the commonly understood problem space. But character sets essentially open up a second dimension to the attack surface.

That is, developers must manage their untrusted data from its initial appearance in input out through its ultimate presentation to the victim user in an HTTP response. So the effectiveness of any filtering is not simply a matter of handling all of the applicable attack vectors that may exist in any given browser client. In fact, it is more complex due to the character set handling that may or may not have occurred before or after the point at which filtering occurs.

Specifications for legacy character sets tend to be vague, if they exist at all. Undefined behaviors have existed for so long, the consequences of seemingly benign code tweaks can be virtually untestable. Code changes involving character sets can break old documents in subtle ways.

The differences between how components handle a given character set is one source of vulnerability. But besides that, character set eccentricities may be well-defined and implemented consistently at the client and server, yet still enable vulnerabilities. Here are some examples where the complexities around character set handling have lead to vulnerabilities.

What do you think? It would be very interesting to see an analysis comparing popular server-side web platforms, other server-side components (SQL servers, etc.), and client-side technology in terms of how they handle the various character set issues across a wide range of supported character sets.

So... Would anyone not like to live in an all-Unicode world?

Here are some related resources from Shawn Steele, Windows / .Net globalization guru: blogs.msdn.com/shawnste/pages/code-pages-unicode-encodings.aspx

Thoughts on Legacy Character Sets

Additional resources