Monday, February 23, 2009

Regular languages with large alphabets

In implementing regular expressions, the formal theory of regular languages can be useful, especially when compiling regexps to DFAs. But the standard definition of a regular language is a little inconvenient when working with Unicode. In a standard DFA, a transition is over a single character. But for the range /[\0-\u0f423f]/, I don't really want to compile a DFA with a million transition edges!

A slightly modified formalism is useful here, where transitions in NFAs and DFAs happen over sets of letters, rather than individual letters. Then, things like ranges and character classes (eg \p{Lower}) are represented as sets which annotate transition edges.

It's perfectly straightforward to translate such a regular expression into an NFA with the standard construction, and the typical algorithm for executing an NFA by running all possible states in parallel works. For DFAs, though, there's a small complication: we have to remove ambiguity about where to go next.

Say you have the regexp /\p{InLatin}b|\p{Lower}c/ This matches strings like "ab", "ac", "Ab", "πc" but not "Ac" or "πb". The simple textbook algorithm for regular expressions would have me expand /\p{InLatin}/ out to /a|A|b|B|.../, and expand /\p{Lower}/ to /a|π|.../. This strategy would work, but the size of the resulting NFA and DFA would be gigantic.

What we actually want to do is change the two outward transitions from the start state--transitions over the sets InLatin and Lower--to transitions over InLatin ∩ Lower, InLatin - Lower and Lower - InLatin. Since these are disjoint, there's no ambiguity about which one to take. In general, for a state with n outward transitions, you have to look at 2n possibilities, since for each subset, you have to make a transition for characters which are in each of those transition groups, but not in any of the others.

Implemented naively, this would make the size of DFAs blow up. For the regexp /ab|cd/, you'd have a transition on characters that are a but not c, characters that are c but not a, and characters that are both c and a. Fortunately, it's simple to work out a system which recognizes that a - c = a, c - a = c and c ∩ a = ∅. With this in place, the resulting DFA (which didn't have any ambiguity in the first place) is just like what it would be without the system, but the DFA for ab|\p{Lower}c has two transitions from the start state: one over a, and one over lower cased letters that aren't a.

I've implemented all of this in the Factor regexp library. If you want to see the code, it's in the main repository, in the regexp branch.

PS. When you're considering transitions over sets, it's possible to consider "regular" languages over an infinite alphabet. It might be convenient to think of Unicode as infinite, since it's so large. But it's easy to prove that for any such "regular" language, there is a string homomorphism to a finite-alphabet regular language where a string is in the original language if and only if its homomorphic image is in the smaller regular language. So, in this way, it's a mathematically boring formalism to study. Other people have studied regular language formalisms with infinite alphabets that actually do have more power--they have the power to compare characters for equality in certain contexts. But that's completely different.

No comments: