Spam étranger et les spam filtres, partie 4 - Single byte character sets
In my last post, I spoke about the problem of character encoding. The ASCII character set works fine in North America (excluding Quebec, Mexico and Miami) but as soon as you leave the continent, you start running into all sorts of weird characters that ASCII doesn't know about. So what do we do? Well, the answer is obvious - you invent another character set that does cover these languages.
We'll start off by going to western Europe. Western Europeans (and Americans) think they rule the world so they came up with a character set to cover the common characters that they use in their various alphabets - Spanish, German, Italian, English, and so forth. These include the following: À, Ä, à, ä, ç, è, é, ü, ß, etc.
One such character encoding is ISO-8859-1. It is less formally referred to as Latin-1. It was originally developed by the ISO, but later jointly maintained by the ISO and the IEC. It consists of 191 characters from the Latin script. This character-encoding scheme is used throughout The Americas, Western Europe, Oceania, and much of Africa. It is also commonly used in most standard romanizations of East-Asian languages. Dutch, Estonian, French and Finish have near complete coverage.
Another common character set of the Latin alphabet is Windows-1252 (also known as WinLatin1), used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages.
The type of character encoding in an email is specified either in the message headers or in the MIME headers. For example, in a recent email I got in my Gmail account, it is specified in the MIME headers:
Content-Type: text/plain; charset=ISO-8859-1
My email client sees that the character set is ISO-8859-1 and any characters that map to the ISO-8859-1 are translated to the Latin alphabet character representing that number (more on this in a future post). Similarly, in a recent spam message:
------=SPLITOR00A_001_340918203D Content-Type: text/html; charset="windows-1252" Content-Transfer-Encoding: 7bit
Another place that you can look to see what character set the message is encoded in is in the message headers:
Content-Type: text/html; charset="us-ascii"
From here, we see that the charset is ASCII. Your email client will use this to interpret the characters in the message. Other common character encodings include the following:
- ISO 8859-1 Western Europe
- ISO 8859-2 Western and Central Europe
- ISO 8859-3 Western Europe and South European ( Turkish, Maltese plus Esperanto )
- ISO 8859-4 Western Europe and Baltic countries ( Lithuania, Estonia and Lapp )
- ISO 8859-5 Cyrillic alphabet
- ISO 8859-6 Arabic
- ISO 8859-7 Greek
- ISO 8859-8 Hebrew
- ISO 8859-9 Western Europe with amended Turkish character set
- ISO 8859-10 Western Europe with rationalised character set for Nordic languages, including complete Icelandic set.
- ISO 8859-11 Thai
- ISO 8859-13 Baltic languages plus Polish
- ISO 8859-14 Celtic languages ( Irish Gaelic, Scottish, Welsh )
- ISO 8859-15 Added the Euro sign and other rationalisations to ISO 8859-1
MS-Windows character sets
- Windows-1250 for Central European languages that use Latin script, (Polish, Czech, Slovak, Hungarian, Slovene, Serbian, Croatian, Romanian and Albanian)
- Windows-1251 for Cyrillic alphabets
- Windows-1252 for Western languages
- Windows-1253 for Greek
- Windows-1254 for Turkish
- Windows-1255 for Hebrew
- Windows-1256 for Arabic
- Windows-1257 for Baltic languages
- Windows-1258 for Vietnamese
- KOI8-R, KOI8-U, KOI7
- Shift_JIS (Microsoft Code page 932 is a dialect of Shift_JIS)
- GB 2312
- GB 18030
- Taiwan Big5
One thing I commonly do is open up the message and take a look at what the character encoding is to try to understand what the language is. Windows-1254 is Turkish. KOI8-R is the most common Russian encoding, followed by Windows-1251. GB 2312 is most common for Chinese. Japanese's most common encoding is ISO-2022-JP.
But this is not the end of the story for encoding. There is far more. In my next post, we will take a look at multi-byte character encoding.