Thursday, January 31, 2008

The most common Unicode-processing bug

The most common Unicode-processing bug is as pervasive as it is trivial: UTF-8 is confused for some 8-bit repertoire and encoding, or the other way around. Most commonly it's something like Windows-1252, a poorly specified superset of ISO 8859-1 (Latin 1). This, I'm fairly sure, is the source of all of those "funny Unicode characters".

For example, on the Unicode mailing list digest as read in GMail, I received a letter a few days ago which opened, "I don’t understand..." I'm not sure where in the transmission line this happened, but I'm sure that nobody typed those characters. More likely, they used a single right quotation mark U+2019 (or rather their email program did), which would be encoded in UTF-8 as 0010 0000 0001 1001 ==> 11100010 1000000 10011001 = 0xE2 0x80 0x99 = ’ in Windows-1252.

(Note: the giveaway that this is Windows-1252 and not Latin 1 is the Euro sign, which is a relatively recent invention and not encoded into any official ISO 8859 that people actually use*. In all ISO 8859s, 80 is reserved as a kind of fake control character.)

Here's how it might have happened: the email program declared that everything was in Windows-1252, though it was not, and the mailing list server correctly decoded that encoding into the corresponding Unicode code points. Alternatively, maybe no encoding was specified, and since Windows-1252 is a superset of Latin 1, which in turn is a superset of ASCII, it was used as a "maximally tolerant" assumed encoding where ASCII would be the normal default. Alternatively, maybe the mailing list itself failed to specify what encoding it was using, and GMail made a similar mistake. This is more likely, as things consistently appear for me as this way when reading the Unicode mailing list.

This bug is at once easy to fix and impossible. Because compatibility with legacy applications needs to be maintained, it's difficult to change the default encoding of things. So, everywhere it's possible, things need to say explicitly what their encoding is, and applications need to process this information properly.

Still, do we care more about maintaining compatibility with legacy applications or getting things working today? In practice, almost everything is done in UTF-8. So it should be fine to just assume that encodings are always UTF-8, legacy programs be damned, to get maximum utility out of new things.

Well, as it turns out, that's not always the best thing to do. Someone recently told me that he suspects a Unicode bug in Amarok: it wasn't correctly loading his songs that had German in the title, though it worked correctly on a previous installation. Instead, I think the bug was in incompatible default settings for GNU tar or ISO format. The songs used to have accented letters, but the files were transferred onto a different computer. Now, those letters were single question marks when viewed in Konquerer, and Amarok refused to open them, giving a cryptic error message.

UTF-8 is fault-tolerant, and a good decoder will replace malformed octets with a question mark and move on. This is probably exactly what happened: the title of a song contained a character, say ö, which was encoded in Latin 1 as 0xF6, followed by something in ASCII. The song title was encoded in Latin 1 when the file system expected UTF-8. The UTF-8 decoder in Konquerer replaced the 0xF6 with a � (replacement character, U+FFFD), but Amarok threw an error for some reason.

So, for all of this, there's no good solution but to be more careful and mindful of different encodings. In most cases, you can use a heuristic to determine whether something is in Windows-1252 or UTF-8, but this can never be completely accurate, especially if other encodings are considered at the same time.


* ISO 8859-15 and -16 actually have the Euro sign, but I really doubt many people use them, as they were released around the turn of the millennium, when Unicode was already in broad use, and come with the difficulties of telling them apart from other ISO 8859s.

Update: An anonymous commenter pointed out that it's not too hard to use a heuristic to differentiate between Windows-1252 and UTF-8.

4 comments:

Anonymous said...

The € sign is included in iso-8859-15, I think. They made new copies of the character sets to accomodate €.

Anonymous said...

due to the nature of utf-8, it is highly unlikely that a byte sequence that decodes as valid utf-8 and has bytes with the upper bit "on" is anything other than utf-8. in other words, utf-8 is easy to recognize and for not trivially short texts the chance of a false positive is low. so decoders should try to read as utf-8 and fall back to iso-8859-x and windows-125x if it's not valid utf-8.

Unknown said...

To the first anonymous comment: yes, I know that, but I don't think anyone uses ISO 8859-15 since it came out so recently, after Unicode had already become popular. To the second anonymous, you're probably right; I'll change the last sentence of the post to reflect this.

Anonymous said...

MS Outlook Express doesn't bother declaring encoding used at all (it doesn't add charset parameter to Content-Type header). It certainly happens on newsgroups, and I wouldn't be surprised if it was equally crappy in sending regular e-mail too.