Overlong UTF-8 Escapes Bite

Every once in a while a security bug pops up that really piques my interest, and a new directory traversal bug that affects Apache Tomcat (https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2008-2938) most certainly made me take notice because I haven't seen this bug type in a lllooonnnggg time.

It caught my eye because of these six little characters:


Many people think these characters represent a 16-bit Unicode character. Wrong. They are an invalid sequence of characters that represent the ‘.' (%2e) character, it's often called an "overlong UTF-8 escape". You may be wondering why I know this little piece of trivia about UTF-8; IIS4 and IIS5 were bitten by the same class of bug eight years ago, and was an attack vector for the Nimda worm. The bulletin that fixed the bug is MS00-078.

Thumbing to page 379 of Writing Secure Code 2nd Edition, I am reminded that the canonical form of a UTF-8 character is the smallest number of bits that can represent that character. Remember, UTF-8 can encode characters wider than 8 bits. Without going into all the involved bit-manipulation, the correct form for a ‘.' character is a one-byte escape: %2e, not a two-byte escape: %c0%ae.

RFC 3629 states that "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."

UrlScan for IIS6, and IIS7's Request Filtering detect and reject non-canonical UTF-8 URLs by default.

A patch for Apache Tomcat is available at https://tomcat.apache.org/security.html.