Language detection, script detection, and how they differ.
Two of the first services out of the gate for ELS are language detection and script detection, both of which are available in the PDC build of Windows 7. Both of these are already being used by different components in different ways and I thought it might be useful to do a general overview of what they are, how they differ, and how you can use them.
First, let’s start with defining what we’re detecting. A language for ELS refers to the semantic content of a string, and corresponds pretty much to your intuitive notion. English is a language. French is a language. Chinese is a language. A script in this context refers to the writing system that the string is encoded in. The Chinese language, for instance, is written in two scripts: Simplified, which happens to be used in PRC, and Traditional, which happens to be used in Hong Kong and Taiwan. Cyrillic is a script. Devanagari is a script. Arabic is a language, but it’s also the name of a script (which is used to write the Arabic language as well as some other languages, such as Urdu and Pashto). ELS allows you to retrieve both the language and the script of whatever string you pass in. If you pass this paragraph, you’ll discover that it’s the English language, written in the Latin script.
When it comes to language detection, we do better and better the more text you pass us. If you pass this string:
This is an English sentence.
We’ll return you a guess list of languages in order of our confidence. In this case, the sentence is English, so English (represented with its appropriate ISO code, en) will be returned first on the list. If we make the string a bit shorter:
This is an
We have less data to work with, but we can guess that this is an English fragment, and so our guess list will still return en at the top of the list. As we get smaller and smaller:
We’ll return fewer and fewer guesses. We can still handle fragments that are recognizably English, such as the first item on this list, but by the time we work down to the single character or word fragments at the bottom of the list, we’ll start returning an empty string; we’d rather tell you we don’t know than make a bad guess. One of our general goals is to do whatever humans can do in terms of language detection. It’s a good bet that if a human can’t tell what language the single T belongs do, then our language detection can’t either.
However, while a single T doesn’t clearly belong to any particular one language, ELS (and a human) can tell with confidence what script it belongs to, since script identification is a property at the character level. So if you pass a mixed script string to ELS and ask for script detection:
I used to know 望遠鏡で日本人男性.
We’ll identify each script range that we find, telling you which characters are Latin, which are Katakana or Hiragana, and which are Kanji, along with their position in the string. If you pass the same string into language detection, we’ll tell you that we find English and we’ll tell you that we find Japanese, but we don’t break the results into associated ranges. This is arguably something that we should do, but it isn’t implemented this way today (and it’s a non-trivial problem). For this reason, a number of ELS callers are using script and language detection in combination, first passing strings into script detection to range-break, and then passing the resulting subranges back into language detection to up the range accuracy.
One question that people are asking very frequently is how long strings need to be in order to rely on the accuracy of ELS's language detection support. The truth is that it depends somewhat on the string itself. Languages that by virtue of their writing system can be uniquely identified (e.g. Thai) will return accurate results from single-character strings, since those strings cannot be any other language. Where a writing system can be used to represent multiple languages (ELS supports dozens of Latin-script languages, for instance), syntactic fragments (not necessarily conplete sentences) are often required for perfect accuracy. ELS provides script detection for every writing system in Unicode 5.1 and language detection for in the neighborhood of 100 languages. We often end up back to the generalization above; if a human can't tell what language it is, it's a good bet that ELS can't either. And if we're not sure, we'd rather tell you we're not sure than offer you a false positive result.
More usage tips to come!