unicode.collationvocab in the working version of Factor in git. Here's what I figured out:
- Weird format of
UnicodeData.txtIt's not documented anywhere that I can find, but the
UnicodeData.txtresource file has a weird range format for specifying certain properties, including character classes, which are used in collation. It looks just like two regular lines, but they have weird names for the characters that apparently need to be parsed. For example, lines that look like
100000;<Plane 16 Private Use, First>;Co;0;L;;;;;N;;;;;mean that all of the characters in the range U+100000 to U+10FFFF have the category Co, the combining class 0, etc.
10FFFD;<Plane 16 Private Use, Last>;Co;0;L;;;;;N;;;;;
- My normalization bugs Working on this uncovered a bunch of bugs in older code, including that my conjoining Jamo behavior inserted nonexistent terminator characters.
- Triple contractions The UCA specifies that collation graphemes should be found by checking if an adjacent character or non-blocked combining character has a contraction with a previous character. But this incremental approach doesn't work with four of the contractions listed in the DUCET which consist of three, not two, elements, without having the first two forming a contraction. So a simple identity contraction for the first two of each of those must be added.
- Combining character contractions Apparently, two combining marks can form a contraction. A straight reading of the UCA wouldn't predict this, but not all of the UCA tests pass unless you check for non-adjacent combining marks being in a contraction together, without a noncombining mark to start it off.
And here's what I still have to do:
- Korean stuff Because of some disagreement with the ISO people, the DUCET doesn't actually decide the best way to sort Korean. Instead, they provide three methods, both of which require modifying the table. I don't really understand the issue right now.
- Tailoring for locales Actually, heh, the default algorithm is inaccurate for any specific locale you might be in. And, for human interfaces, it's actually pretty important that the sort order corresponds to expectations. So, if you want to do sorting that's correct, you have to modify the data. Unfortunately, the standard doesn't go into the specific algorithms for tailoring the table, though there is data available through the Common Locale Data Repository (CLDR).
- Efficiency My implementation is both time and space inefficient, because I paid absolutely no attention to those, because solving the basic problem is hard enough (for me). Collation keys should be made shorter, and they should be made in fewer passes over the string.
Update: Here's an overview of the words that are defined in the vocabulary. It's about the minimum that any UCA implementation should have, in my opinion:
sort-stringsThis word takes a sequence of strings and sorts them according to the UCA, using code point order as a tie-breaker.
collation-keyThis takes a string and gives a representation of the collation key, which can be compared with
string<=>This word takes two strings and compares them using the UCA with the DUCET, using code point order as a tie-breaker.
primary=This checks whether the first level of collation is identical. This is the least specific kind of equality test. In Latin script, it can be understood as ignoring case, punctuation and accent marks.
secondary=This checks whether the first two levels of collation are equal. For Latin script, this means accent marks are significant again.
tertiary=Along the same lines as
secondary=, but case is significant.
quaternary=This is a little less typical (and definitely non-essential, unlike the other things), and it's my own nomenclature, but it makes punctuation significant again, while still leaving out things like null bytes and Hebrew vowel marks, which mean absolutely nothing in collation.