Unicode, IDN (IDNA), EAI (IMA) and Homograph Security

I wrote about IDN & Security before http://blogs.msdn.com/shawnste/archive/2005/03/03/384692.aspx but thought I'd share some of my more updated views about security of URLs/IDN/Unicode/Email addresses.

People haven't really bothered much with DNS or character based security when it was limited to ASCII.  I'm not sure if this because people just didn't think about it, or if they thought there wasn't a problem or whatever.  What security attacks happen have been regarded more as "oh, that's curious" rather than a real concern.  Basically there seems to be a presumption that a script, like the ASCII subset of Latin, are inherintly secure.  Therefore it would seem reasonable that if ASCII Latin can be secure, then other scripts, or mixed script environments have homographs, then those scenarios must be insecure and are therefore broken.

Latin and ASCII aren't Secure

The problem with that logic is that it's flawed.  Homographs exist in Latin/ASCII, however http://rnicrosoft.com tends to be regarded as "quaint and amusing" rather than a security problem.  (There used to be a web page there, dunno what happened).  Similarly g00gle or MlCROSOFT or whatnot can all happen in ASCII.  Some things can be done to ASCII to limit the risk, such as choosing fonts or making things lowercase, but that's not always possible. 

Strings are Typed and Read by Humans

Even if the scripts themselves are perfect, the strings we use with the scripts are not.  For example, users have to type them in, and they may or may not use upper or lower case (in cased scripts).  I heard one computer expert indicate that users should just figure out how to enter URLs in lower case, in Unicode Normalization Form C.  (Instead of addressing the problem we should educate all the users).  I wish he were joking.

Depending on the context, there are things you can do to ASCII only strings that can confuse users.  For example http://microsoft.secure.com isn't going to necessarily go to a Microsoft site.  http://secure.com/microsoft.com is a similar trick.

DNS isn't the only subject of these problems.  I get mail all the time in the form company@mail-servicing.com where "company" is a legitimate company and "mail-servicing" is the people they've contracted to send their bulk mail.  So it's impossible for me to determine if that's actually a good address for the company.  Even worse is when the mail contains a link.  "Provide feedback about your recent warrenty support to http://feedback-surveys.com/OEMsupport"

Strings aren't Even Strings

Sometimes what we click on isn't even related to where we end up going.  We've all seen phishing attacks that are look like mybank.com but go to an IP address that no one can tell if it's real or not.

Strings aren't Always Specific

In some environments strings often aren't even very specific.  I'm pretty certain that if I want a live.com account that I won't get shawn or shawns or even shawnsteele.  Instead I'll be shawn7935 or something.  There's another Shawn here at work that gets some of my mail from simple typos, let alone malicious intent.  There's a pretty good chance that Fred8374 could pass himself off as Fred8347 if he really wanted to.  

We've even been trained that strings don't even have to be close.  If I buy something on eBay from "JoesBestStuff", it takes some faith for me to pay SallySewing7@live.com (apologies if those are real accounts).  I've been quite amused at the varation betwee "seller's name" and the email sometimes.

Even when we expect them to be the same, there are many spellings for some words.  "Mohammed" is often transliterated differently to Latin.  Unless you deal with one quite often, you're likely to assume most spellings are the same.

Globalization of Strings

Now we've figured out that strings aren't secure, and we'll get tricked even if they were secure.  How does that change in a global environment, such as with IDNA or EAI/IMA strings?  Not much.

Sticking to Latin, you suddenly gain a bunch of look-alikes (homographs) by allowing non-ASCII values.  Strings like mícrosoft, mïcrosoft and mıcrosoft are all “close enough” to be convused, particularly at a quick glance, even more so if the user is conditioned to expect the "real" string.  E.g:  "Important security update for windows, go download it from Mícrosoft.com"  We're already expecting to see microsoft, so the few different pixels are easily missed.

For other scripts the problem can be much more severe.  Complex scripts can have simliar appearing strings, and many include numerous characters.  Chinese for example has enough characters available that it can be fairly easy in some cases to find a rare character that is similar in appearance to a common character which people have been preconditioned to expect.

"I Solved Homographs"

This leads to a typical problem for developers, particularly "Western" Latin-script based developers.  Developers tend to expect that if they solve script mixing so that we can't mix up Cyrillic and Latin, that they've solved the homograph problem.  Instead, they've barely scratched the surface and effectively buried their heads in the sand.

In some cases the "solution" can be worse than the problem.  For example, some browsers decide that I don't understand Cyrillic since my user locale is en-US (or Klingon), and then prints out punycode.  That's mildly useful to me as a warning, however it does the same thing for Chinese.  It's very unlikely that I'm going to confuse Chinese with Latin, but I'll get Punycode in the address bar anyway.  Now I have no chance of finding out what the actual URL is supposed to look like.  Punycode is all gibberish, but I could probably decipher a Chinese glyph enough to see if it looked similar to what I expected.  With any punicode strings, I don't even need homographs to confuse me, any Chinese would look the same.  For that matter I could be expecting Chinese, but it could actually be Japanese or Korean, or Cyrillic for that matter.  I'm not trying to say that the browsers' approach is "wrong", just that while this approach may address some problems, it can also cause new ones.

Most of the "solutions" to Homographs that I've seen are similar in my opinion.  They may address a specific issue, but don't solve the entire problem globally.  I also think some approaches are unnecessarily limiting.  Mitigations that reduce the surface area for an attack are useful, however developers should recognize the limitations of those approaches and make sure they aren't spending tons of effort "shutting the window, but leaving the front door wide open."  That only provides a false sense of security, which can be far worse than the original problem.

Comprehensive Solutions

So instead of thinking that strings like URLs are inherintly secure somehow if they're ASCII, and focusing on the differences from ASCII, like Cyrillic homographs, we should rather assume that ANY URL might not take us to a place we want to go.  Even an ASCII one.

A much better solution to URL security is one that addresses the entire system rather than focusing on Homographs.  IE, for example, detects malicious web sites (I don't know exactly how it works, but I gather there's blacklisting and bad behavior detection, kinda like virus checking for web sites).  This is far more effective than preventing mixed scripts, and has the advantage of working with ASCII only URLs.  It also does a good job against homographs, pretty much making the punicode-in-the-address-bar irrelevent.  It also works with many forms of attack, even non-obvious ones. 

My opinion is that if you do a "good job" of detecting any phishing/spoofing type web site, even ASCII-only, then the need for Homograph detection is much reduced.  And if you can't do that, then the attackers will merely add an extra label or something to get around your homograph detection.

Mitigation by Protocol

For things like IDN, it is interesting to consider how the protocol itself approaches security.  Some things are "obvious" as not being interesting for a name.  Compatibility characters, control characters, etc. could somewhat readily be excluded.  Some things are generally considered technically "obvious" to some users, but may frustrate others.  It is generally considered that lower casing the DNS name causes less confusing (can't mix up lower case l with capital I), but I doubt that AAA.com prefers lower casing.  Similarly IDNA2003 allows unicode "symbols," which are widely regarded as being useless, particularly since they're hard to type, but I suspect that someone would like I♥NY.  So there's a gray area that gets a bit confusing.

Consideration for other protocols is similar.  EAI (email) is interesting because it basically defers "correctness" to the registrar (whoever runs the mail server).  IDN provides some restriction by protocol and more at the registrar level.

One problem with restricting valid characters at the protocol level is that it works OK in a small set, but once you get to a global audiance the rules get very complicated.  Domain names allowed (most) English names when they were restricted to ASCII, but German and French had difficulties.  With IDN additional languages are supported, but perhaps the needs of an English registrar and a German one differ.  A complete set of rules applicable world-wide for all strings in all languages may not be possible (eg: turkish i), but even if they were, they would be very complex and difficult to implement for every application adopting a protocol.

Mitigation by Registrar

Restriction at the registrar can be more effective, though perhaps less consistent.  A registrar could be like a domain name registrar, but for these purposes you could also think of the person that assigns user accounts at a business, or email address registration from your ISP.

Registrars can restrict languages to those used in the country they support.  They can bundle or block homographs or alternate spellings (like Traditional and Simplified Chinese spellings of the same word.)  In a business they could have certain rules.  First name, last initial, or first initial, last name is common for user accounts in many companies, at least until they get too many employees).

IDN has some restrictions by protocol, but allows much tighter restriction at the registrar level.  Ironically, a label at a lower level could then have different "rules" than at the higher level.  EAI allows the local part to be determined entirely by the provider/registrar rather than the protocol.

Rules at the "registrar" level can still be very complex for a complete set of rules, however cases with conceptual differences can still be adopted as applicable for the registrar's environment, whereas a protocol level rule has to either be too flexible, or disallow one registrar's legitimate scenario.  Rules at the registrar level can also be adjusted more readily than at the protocol level.

Mitigation by Application

An application can also decide to be more comprehensive than the protocol.  An application may also have more information, such as blacklists or user settings.  They can make choices for some users like "they only read English, so don't bother with Cyrillic then," and a different choice for a different user.  Applications can also potentially be grayer in their behavior.  Instead of "allowing" and "disallowing" strings, they can say "gee, I'm not so sure, you really want to do this?", or flag it and continue.  They can also be dynamic, such as when you add a sender to a junk mail filter.

IDN vs EAI/IMA vs Unicode

Pretty much this entire "strings aren't secure" concept applies to any Unicode (or for that matter any other code page) string.  That could be an IDN domain name, an EAI mail address, a user account name, etc.  Some environments may be more ameniable to certain solutions than others, but the types of attacks that impact a Unicode IDN label could also succeed with the local (user name) part of a Unicode EAI email address.  The general concepts are portable.

I used IDN heavily as an example, but the same things happen to EAI addresses, user account names, logon credentials, etc.  Anything that uses Unicode, or strings, needs to realize that strings can't be expected to be inherintly "secure."

There's more info on some thinking about Unicode Security in Unicode TR#39 http://www.unicode.org/draft/reports/tr39/tr39.html.  TR39 addresses the appropriate use of Unicode characters and homographs, but this is at best a mitigation of the more general security concerns of identifier strings.  Phishing and spoofing would still happen even in plain ASCII.

Hope this was helpful, or at least interesting,