Recycling: The problem with auto-translation

Benefits of auto-translation

When we start localizing a new version of Windows, we obviously don't start from scratch. Instead, we try to recycle as much as we can from previous versions and other projects.

There are several benefits of recycling
- You can get better consistency in your localized product
- If you're outsourcing the localization work, you can avoid paying for work that has already been done
- If you're me, there's only so many times you can translate "Click next to continue" before you go bonkers...

Recycling is done in several different ways, and one of those ways is through auto-translation. Auto-translation works something like this: For each string that needs to be localized in the new product, a tool tries to find a matching string in a set of glossaries. If a match is found, the translation in the glossary is copied into the new product. As you see, auto-translation is not the same as machine translation -- it's simply a way to reuse previous translation work.

In addition to the benefits above, auto-translation brings some more to the table. Auto-translation works well for recycling across files/projects -- you can for instance use our glossaries as a base for auto-translation. Also, auto-translation can be, well, automated. It's therefore quite tempting to try and auto-translate every project from all kinds of sources, to get as much "for free" as possible.

Bad idea. There are problems with auto translation.


The first problem is ambiguous sources. Whatever glossary you use, it's likely that it contains inconsistent translations. Some of these will be intentional, some not. How will the auto-translation algorithm know which item to pick if several possible matches are found? Will it pick the most common translation? Will it skip this item? And after auto-translation is done, how will the result be reviewed? If the algorithm just picks one item, will the reviewer have to look up other possible translations? Doing so undermines the cost savings of auto-translating. If the algorithm skipped a string, how will the reviewer know what strings are left but can actually be found in glossaries? If the reviewer doesn't know that a term was skipped because there were two inconsistent, but equally valid, translations, how can you avoid introducing a third inconsistency to the mix?

The second problem is context. Regardless of the glossaries you use, your project probably contains strings that can have more than one translation depending on context. Is "volume" related to sound or to disks? Does "female" refer to the gender of the user or to what a plug looks like? Even if your glossaries only contain "known good" translations, you can't predict how new strings are being used. Also, it's likely that some UI elements need different translations depending on what type of control they're being displayed in.

The third problem is accuracy. Your auto-translation tool may have settings that affect the result, such as how close a match needs to be, whether capitalization is important, if decorations such as ellipses and hotkeys are stripped out when matching and if the match should take resource type into account. If your auto-translation tools assigns hotkeys arbitrarily, you can expect to have duplicates all over the place after auto-translating. This will take time to clean up. And if your tool has problems handling nulls, you can even end up seeing random crashes in your localized build.


Here's an example that's pretty embarrassing. This is from Swedish XP, the dialog where you can activate your installation over the phone.

This was not caused by me not knowing my ABC. This was caused by auto-translation. The problem is that we have a lot of HTML pages with hotkeys. HTML hotkeys are created by using the AccessKey attribute on e.g. a button. The value of this attribute is included in my project so that I can change the hotkey to a character that's actually used in my translation. Unfortunately, this string "F" was auto-translated into the hotkey for some other element. Less than stellar.

Here's another ugly that was probably caused by auto-translation. This dialog box is taken from Swedish Windows 2000, from the Regional Settings control panel applet. When you use the Spanish locale, you can pick sort order:

In this dialog box, "Traditional" has been translated into "Traditionell", which is correct. "International" has been translated into "International phone calls". Whoops.

Finally, here's a classic mistake from Windows 98, something I was recently reminded of in a Swedish forum.

The theme "Space" has mostly been correctly translated as "Rymd" (outer space). Except of course for where it got translated into "Blanksteg" - the space key...


To try and avoid issues like these, I'm recommending a few things -

Don't auto-translate short strings. In my experience, short strings make up for the bulk of the resource count, but only a small part of the overall word count. The longer a string is, the more accurate auto-translation will be (context & semantics is less of an issue). Therefore, simply do not auto-translate strings that are shorter than, say, 20 characters. Sure there will be a bit more to localize, but you'll save time on reviewing auto-translation and you run less risk of overlooking mistakes.

Maintain a standard UI glossary. If you have a glossary with the standard strings, like "Browse...", "Back" or "Add..." and always use this to auto-translate UI only, you can save yourself a lot of time reviewing strings later.

Use cleaner glossaries. If you control the glossary you use as a source, then you can exclude ambiguous strings. It's not always easy to predict what's ambiguous though, and the glossaries might not be under your control.

Take time to review auto-translations, and make sure to have natives test your product to uncover linguistical mistakes. If you're localizing software and UA separately, it might be worth letting your UA people hack away at the software for a few days before they start UA. This can help bringing the two teams closer, it can help you find a lot of bugs that otherwise will impact UA, and can also give those who will localize your help files a chance to get to know the product up front. Result - everyone's happy.

Finally, please please please make time to fix up old mistakes before shipping the next version. Your customers will love you for it.