Wednesday, November 21, 2007

When Unicode doesn't just work itself out

So, in my job as a student worker at Carleton College, I'm working on programming the website. Just a week or so ago, I basically completed a classified housing ads module, which I'm fairly happy about, even if it was pretty trivial. They've made their own in-house CMS, now open-source, written in PHP and called Reason. Being a very respectful employee, I won't allude to any possible irony in this name. Not at all.

Anyway, a couple of days ago, one of the non-student workers, Margaret (the names of those involved have been swapped with others' to protect their identity) came in with a question: can anybody, who's not doing something important, come up with a regular expression to detect bad UTF-8? Since I was basically unoccupied, reading a book from the library in preparation for my next super-secret project, and since I know a little about Unicode, I volunteered.

I sat down with Margaret as she explained the problem: somehow, a bunch of bad Unicode had gotten into Carleton's database. This isn't a bug that we can reproduce, but a number of times in the past, there have been these question marks, like �, appearing in the content. Whenever it comes up, we delete it, but we need a regular expression to find all of the bad parts.

When this situation occurs, there is malformed UTF-8, where UTF-8 is the encoding used internally in Reason. Reason uses the Unicode-blind approach I discussed earlier. The Unicode-blind approach failed here, though. Most likely it's because of strange, invalid input from browsers which can't easily be replicated, except by artificially constructing queries. All good web designers know that you can't trust the browser (a fact which I internalized only recently) and that apparently extends down to the validity of UTF-8 itself.

Margaret's idea was to find characters whose most significant bit was on, surrounded by characters whose most significant bit was off. However, there are other places where Unicode could be malformed and checking for this specific case isn't enough. With a bit of cursory Googling, we found phputf8, a pure-PHP library which has functions for validating, cleaning, encoding and decoding UTF-8. We could use this library to clean all input, to ensure that it is in valid UTF-8 and will appear without � all the time. This would be a bit computationally intensive, though.

Another approach is to take advantage of the database itself. Reason works on MySQL, but what character encoding does it use? Since Margaret didn't know, I asked Eric, our omnipotent server admin. Carleton actually has one SQL server instance where there are many different databases. Some of these databases use UTF-8, Eric told me, but the main Reason database is encoded in ISO Latin 1, as far as SQL is concerned. The content is, in truth, encoded in UTF-8; when Reason was started, MySQL was in version 3 and never supported Unicode, and we continue to use the same database. Eric guided me through the migration in a wonderfully hackish way: serialize the whole database, then in that serialization, replace all instances of "latin1" with "UTF8", then deserialize the database.

There are still a couple potential problems: for one, it's possible that there could be invalid UTF-8 in the database that MySQL doesn't check for. The best way to fix this would be to do a one-time cleanup with something like phputf8 before the migration. Otherwise, things will be sort of corrupted from the start. After this, it shouldn't be possible to put invalid UTF-8 in the database. The second thing is that there are still some remaining Unicode issues, most notably normalization. Though we could trust the browser to give us text in NFC, that's generally a bad idea. It'd be better to normalize all text, and MediaWiki has code for that.

Once all strings are considered UTF-8 by the database, things should move much more smoothly. Not only will invalid UTF-8 not be accepted, but collation will work properly, since MySQL supports Unicode collation consistent with the Unicode Collation Algorithm (UCA) and Default Unicode Collation Element Table (DUCET). But there's a minor caveat: MySQL 5.0 and lower only supports three octets of UTF-8. In 2001, Unicode 3.1 introduced the possibility of a fourth octet, for supplementary planes, used for obscure Han characters, math symbols, old scripts, etc. MySQL 5.0 doesn't allow any of these to be used in UTF-8 mode, though the Unicode-blind approach does allow it.

Nevertheless, it's good to be aware of Unicode. The fact that you can sometimes get away without noticing Unicode and its encodings is not a reason to ignore it.

Update: Alexander Barkov, a MySQL developer, pointed out to me that MySQL 6.0 supports three encodings, utf8, utf16 and utf32, including support for code points outside the BMP. Previous versions of MySQL had only three octets of utf8, or the fixed-width ucs2, which only allows the BMP. I'm very happy about this, but of course it takes some time to upgrade versions.

No comments: