fully entering the i18n discourse community with bidirectional text


Punch line:


Because Microsoft understands the available market in Hebrew and Arabic locales, i have been working in the realm of Bidirectional Text.  More accurately my head is dizzy with words like abecedary, abjads, diacritics, abugidas, concepts like logical order vs. display order, the fact that languages can be written in both directions at once, and "i can hardly tell my right from left hand" (thank you sujit).  For instance, i became quite obsessed with ligatures as they can illustrate where a piece of software does not have full Unicode support.  Ligatures will get morphed by some software.  The user will be able to understand it, but it will look funky.  It's like "umbrella" being written "ümbrella"...you can read it, but it's obviously wrong.  The umlaut'd u is an example of a diacritic.  These diacritics are important.  Such accents allow for unambiguous interpretation of religious texts and and law, or for syntactic sugar in literature.   i had erroneously showed my boss a comparison of our product displaying in Silverlight 3.0 vs. 4.0 runtimes, noting that text displayed "just fine."  However, a PM pointed out to me that ligature examples would need to be considered before I could say that it was displaying correctly in SL 4.0 and 3.0, thus precipitating my Alice in Wonderland-like adventure into the i18n discourse community.


Background reading:


An umlaut is the orthographical representation of a type of sound shift in spoken language. A very similar diacritical mark (called diaeresis or "trema") is used to signify a linguistic hiatus. In modern computer systems (using Unicode), umlaut and diaeresis are represented identically: ä represents both a-umlaut and a-trema


Pasted from <http://en.wikipedia.org/wiki/Umlaut_(diacritic)>



On the other hand, copies of the Qurʼan cannot be endorsed by the religious institutes that review them unless the diacritics are included


This is why in an important text such as the Qur'an the vowels are mandated.



The only compulsory ligature is lām + ʼalif. All other ligatures (yāʼ + mīm, etc.) are optional.

(isolated) lām + ʼalif (lā /laː/):


(final or medial) lām + ʼalif (lā /laː/):


Unicode has a special glyph for the ligature Allāh (“God”), U+FDF2 ARABIC LIGATURE ALLAH ISOLATED FORM:


The latter is a work-around for the shortcomings of most text processors, which are incapable of displaying the correct vowel marks for the word Allāh, because it should compose a small ʼalif sign above a gemination šadda sign. Compare the display of the composed equivalents below (the exact outcome will depend on your browser and font configuration):

lām, (geminated) lām (with implied short a vowel, reversed) hāʼ :


ʼalif, lām, (geminated) lām (with implied short a vowel, reversed) hāʼ :




Pasted from <http://en.wikipedia.org/wiki/Arabic_alphabet>



The Arabic script has numerous diacritics, including iʿjam (إعجام), consonant pointing, and tashkīl (تشكيل), supplementary diacritics. The latter include the ḥarakāt (حركات, singular ḥaraka حركة), vowel marks.


Pasted from <http://en.wikipedia.org/wiki/Arabic_diacritics>




According to the formulations of Daniels, abjads differ from alphabets in that only consonants, not vowels, are represented among the basic graphemes.


Pasted from <http://en.wikipedia.org/wiki/Abjad>


An abugida (pronounced /ˌɑːbuːˈɡiːdə/, from Ge‘ez አቡጊዳ ’äbugida), also called an alphasyllabary, is a segmental writing system which is based on consonants, and in which vowel notation is obligatory but secondary. This contrasts with an alphabet proper, in which vowels have status equal to consonants, and with an abjad, in which vowel marking is absent or optional. (In less formal treatments, all three are commonly called alphabets.) Abugidas include the extensive Brahmic family of scripts used in South and Southeast Asia.

The term abugida was suggested by Peter T. Daniels in his 1990 typology of writing systems.[1] It is an Ethiopian name of the Ge‘ez script, ’ä bu gi da, taken from four letters of that script the way abecedary derives from Latin a be ce de. As Daniels used the word, an abugida contrasts with a syllabary, where letters with shared consonants or vowels show no particular resemblance to each another, and with an alphabet proper, where independent letters are used to denote both consonants and vowels. The term alphasyllabary was suggested for the Indic scripts in 1997 by William Bright, following South Asian linguistic usage, to convey the idea that "they share features of both alphabet and syllabary".[2][3] Abugidas were long considered to be syllabaries or intermediate between syllabaries and alphabets, and the term "syllabics" is retained in the name of Canadian Aboriginal Syllabics. Other terms that have been used include neosyllabary (Février 1959), pseudo-alphabet (Householder 1959), semisyllabary (Diringer 1968; a word which has other uses) and syllabic alphabet (Coulmas 1996; this term is also a synonym for syllabary).[3]




In general, a letter of an abugida transcribes a consonant. Letters are written as a linear sequence, in most cases left to right. Vowels are written through modification of these consonant letters, either by means of diacritics (which may not follow the direction of writing the letters) or by changes in the form of the letter itself.

Vowels not preceded by a consonant may be represented with a zero consonant letter, modified to indicate the vowel, or separate letters for each vowel, that are distinct from the corresponding dependent vowel signs.

Consonants not followed by a vowel may be represented with conjunct consonant letters where two or more letters are graphically joined in a ligature, or dependent consonant signs, which may be smaller or differently placed versions of the full consonant letters, or may be distinct signs altogether.



Pasted from <http://en.wikipedia.org/wiki/Abugida>


This document provides advice for the use of HTML markup and CSS style sheets to create pages for languages that use right-to-left scripts, such as Arabic, Hebrew, Persian, Thaana, Urdu, etc. It explains how to create content in right-to-left scripts that builds on but goes beyond the Unicode bidirectional algorithm, as well as how to prepare content for localization into right-to-left scripts.


Pasted from <http://www.w3.org/TR/i18n-html-tech-bidi/>


Which languages are written right-to-left (RTL)? Also: Which script should I use?


See the Background information.

Languages don't have a direction. Scripts have a writing direction, and so languages written in a particular script, will be written with the direction of that script.

Languages can be written in more than one script. For example, Azeri can be written in any of the Latin, Cyrillic, or Arabic scripts. When written in Latin or Cyrillic scripts, Azeri is written left-to-right (LTR). When written in the Arabic script, it is written right-to-left.

Which script should I use?

If a language can be written in more than one script, which script should a web designer or localizer use, or should the text be provided in all scripts?

The answer will depend on your target audience. The script may change for different countries or regions. The script may also change by legislation or with changes in government policy. For example, to reach the Azeri-speaking population in Iran, you would use Arabic script. From the late 1930s, Cyrillic was the script of choice in Azerbaijan itself and became policy in 1940. Due to the fall of the Soviet Union, beginning in 1991 a gradual switch to Latin occured, becoming mandatory for official uses in 2001. However, for your target audience and unofficial uses, you might want to use Cyrillic for older audiences and Latin for younger audiences, and most likely both to reach the general Azerbaijani population. If you want to reach all Azeri speakers, you would use all 3 scripts. (Note that there might be terminology and other differences among Azeri speakers in different countries, just as there are differences between English or French speakers in different countries.)

You also should be aware that your choice of script may have political, religious, demographic or cultural overtones. In countries where the language of higher learning was Russian, Cyrillic will be used by educated people. Latin is associated with Pan-Turkic movements, and more generally can indicate Western-tending movements. Arabic script has associations with Islamist movements.

More generally, just as you research which languages are required to serve different cultures, you may need to investigate the correct script or scripts to use. There are suggestions in the Directionality of Commonly Requested Languages Table below.


Pasted from <http://www.i18nguy.com/temp/rtl.html>