Data Transformations & Data Security

Over a decade ago I posted "Code Pages and Security Issues" - which touches on an interesting case of security checks when the data is transformed.

In short: If you run a security check on your data, and then do some sort of transformation on that data, then your data may no longer be secure.  In math speak, security tests and transformations are not "communitive."  Their order is important.

In other words:  Do whatever transformations you need to do, and THEN do your security check.

There are a bajillion things we do in computing that cause data to be transformed to some other shape or view.  Very few of those are truly 1:1 operations, and therein lies the problem.  The things that my input represents may not end up being represented the same way in my output.

Indeed, if I have an input and an output string, then I must be expecting that the information is being transformed! (otherwise why would I be calling the function?)

There are lots of places that this can happen.  Some of them are more obvious than others:

  • Codepages (what I focused on last time)
  • Escape and Quoted Sequences
  • Named Entities
  • Casing conversions
  • String Normalizations
  • Parsing and Tokenizing
  • XML or other markup
  • Clients and Servers
  • Allow/Deny Lists
  • Starting in the Middle
  • Are the Klingon?
  • Everything I forgot (or can't know)


Codepages (aka Encodings) aren't transitive.

I tried to think of one that is transitive for this example, but even the UTF transforms have different spaces of illegal codepoints.  For example, a  high surrogate in UTF-16 cannot be encoded in UTF-8 or UTF-32 without a matching low surrogate.  If a bad byte creeps into your sequence, either accidentally or maliciously, then you're at the whim of the decoder's error handling, which could be dropping the unknown byte(s), returning replacement character(s), or even allowing the illegal sequence if the library missed that test case.

Every other codepage is far worse.  The set of characters allowed in Windows-1252 and UTF-8 differs wildly.  Fortunately UTF-8 is a superset, however some APIs allow "best fit" mappings or other behavior.  The conversion is guaranteed not to round trip, but whether characters are dropped, replaced with ?, or best-fit can vary.  Even the meaning of the various codepoints can change depending on the OS platform or library being used.  Larger encodings tend to be worse at that.

Some codepages are depend on state from previous bytes.  Jumping into the middle of the stream can cause misinterpretation of the data, as can a lost byte or corrupted bit.  Sometimes even breaking the input buffers can lead to boundary issues if a lead byte is at the end of one block and the trail byte at the beginning of another.

Escape and Quoted Sequences

There are a ton of ways to escape various sequences, particularly when trying to "tunnel" a larger character set through a smaller channel, like Unicode characters being forced into an ASCII mechanism.  Or the % mechanism for URL queries or the & Unicode encoding for XML/HTML.  If the security check doesn't decode these sequences, then it is trivial to sneak a malicious string through the escaped string and into the target data stream.  Worse, it is easy for the security test to miss a detail of the conversion.

I included quotes in here as a special type of escape sequence.  They can be particularly troublesome for some environments.

Named Entities

These are sort of like a special escape sequence.  & or other tokens representing a special character.  I'm calling them out because they might be easy to miss when trying to pre-parse a sequence.  They exact sequences supported can also vary wildly, so a sequence from one system may be misunderstood on another.  Security checks can't just look for the start and end of the sequence, because who knows what the decoder does with bad sequences?

Worse, sometimes these things can be nested, so an escape sequence might tokenize a named entity, which is then processed later (or vice versa)

Casing Conversions

Note that there are other conversions that act similarly to casing but might be unique to non-Latin languages, such as the Kana or width mappings in Japanese, or digit mappings for many scripts.

Casing is particularly tricky as many systems have different casing tables.  Additionally, the linguistic behavior of some casing operations can vary depending on the language (for example, the Turkish I).  An attacker might be able to sneak a İ through as system where i is important if the linguistic operations differ.  Client validations could use different "rules" than a server side, etc.

If a system depends on, say, lower case matching, for sensitive strings (password?) and the attacker can sneak an upper case letter through one layer due to differing rules, they may be able to find a weakness in the system.

String Normalization

We often need to normalize strings into some sort of sane form so that the application can deal with them.  This can be something like Unicode Normalization, or other manipulations that get a variety of inputs into a manageable form.  Doing a ToLower() operation on a string before manipulating it is one kind of normalization.  Removing escape sequences could be another.  Rearranging tokens into a predefined order might be another.

Unicode normalization provides known mappings from various sequences to other interesting (to an attacker) forms and is an easy target.  If different layers handle the normalization differently, there may be an opportunity to sneak something through.

This can be particularly troublesome retrofitting code paths that expected only ASCII data to now handle Unicode.  Early ASCII checks on a Unicode string are trivial to bypass when Unicode normalization occurs later in the process.  Since most of the interesting tokens we use when parsing are in the ASCII range (?, =, ., @, /, etc.) this can be a big gap, particularly

Parsing and Tokenizing

Eventually we probably have to figure out what that input "meant."  Chopping up a string into it's component bits is a form of transformation.  Of course, this is one that most applications eventually need to be able to handle at some point.  If I chop up a string into 10 key/value pairs, my application is going to need some sort of assurance that those 10 keys/values are as expected.

Tricks here could be to use escape sequences, abuse delimiters, buffer sizes, or other mechanisms to confuse the tokenizer.  If I can manage to confuse the tokenizer, I may be able to cause malformed tokens or malformed values with interesting side effects.

Typically an application is going to try to ensure that the stream to be parsed is sane prior to any parsing effort, but it shouldn't necessarily assume that any critical outputs are legal.  Maybe I disallow "password" in the parser, but if that's a valid key in a key/value pair and someone can trick the parser, then my app could get confused.

XML and Other Markup

This is basically like parsing/tokenizing, but I divided it because this probably has more complex structure than a "simple" key/value query string.  Here a misplaced escape token or mapping might be able to trick the markup into confusing the layers, perhaps injecting a parent that makes a critical block vanish into a hidden child.  The richer the markup mechanism, the more opportunities there are to attack the surface, and if that markup is passed through multiple libraries, perhaps having different transformations, then someone may be able to take advantage of the discrepancy.

Allow & Deny Access Lists

A bulletin board I play with had some functionality like this, granting or denying permission based on lists of information.  If someone can sneak an alternate form of a string past an "Deny Access" list, then they could gain unintended access.  This could be as simple as alternate forms of bad-language in a post filtering list, or manipulation of a search index by providing unexpected forms with similar meanings.

Starting in the Middle (or the beginning, or the end)

Unfortunately, computers often have to deal with large data files and other objects and it can be inconvenient to manage the entire entity at once.  Stateful encodings are particularly annoying as a switch at the beginning of a 20GB file could change the meaning of a block read at the end of the file.  But this can even happen to simple structures.  If I start looking at an HTTP query string from the end, I might stop when I reach a ? (presumably the beginning of the query string).  Yet another parser might look from the beginning and assume that the query starts at an earlier ?.

It Doesn't Have to Be Malicious

Maybe the user's just trying to see if they can get a Klingon user name on your music service.

Everything I Forgot (or Can't Know)

I'm sure I missed some interesting examples where useful transformations can lead to interesting behavior.  And obviously I don't know about all the cool applications out there that might have some brilliant protocol that connects their clients to their servers.

But even then, if it has an input being transformed to some other output, it is likely subject to the same sorts of transformation problems that lead to security issues.  It's important to ensure the security check is being done on the "right" form of the data.

Reverse Engineering the Transformation Logic

The solution to this transformation problem and the need for an early security check is fairly obvious:  Make sure that the security check "understands" the transformation.

Now, I want you to pause and take a moment to think about that.  This is very appealing.  If the security check can know what the transformation is going to do, then maybe we can get an early read on the security of the inputs before sending the stream on to the business logic of our system.  (Kind of like a firewall I suppose, making sure that only good data gets through).

But:  Is reverse engineering the transformation realistic?  Is my check going to be able to really do EXACTLY the same thing that the eventual transformation does?  When that business logic does its transformation, it's going to call an API in a library or the system to help it.  If there's a bad UTF-8 byte will it be dropped?  Turned into an � or just left in to trip me up?  Is the library an über-library that tries to undo % escapes, named entities and quotes all at the same time?  In which order?  Will a named % turn sneak through to be decoded as a % escape later?

I do know that it is very difficult to predict how these transformation APIs behave exactly in all circumstances.  Even then they may be updated to fix a bug.  Standards, such as the UTF-8 encoding guidance, have changed over time as security weaknesses were discovered.  You might have an older library - or be using a newer one.


I've encountered an endless stream of bugs from applications that are convinced they *MUST* do something very important *BEFORE* doing the transformation.  I'm pretty sure that after jumping through a ton of hoops to try to get things to work, they still have troublesome edge cases.  Try not to be that app.

Hope this helps someone,