Monday, August 27, 2007

Symbolic constants and enumeration

For grapheme classes in Unicode, I needed a bunch of symbolic constants 0..n to describe a number of things. (More graphemes in the next post.) At first, I used regular old words corresponding to regular old numbers, in a typical Factor idiom:

: Any 0 ;
: Extend 1 ;
: L 2 ;
! ...

But later, in constructing the table (discussed in more detail below), I found the need to write arrays like { Control CR LF }. But in Factor, this doesn't work the way you might think it does; it's basically equivalent to, in Lisp, '(Control CR LF) (a list of the symbols) when we actually want `(,Control ,CR ,LF) (a list of the values that the symbols refer to).

How can we get that quasiquotation in Factor? The most obvious way is using make: [ Control , CR , LF , ] { } make. But that solution is pretty wordy, and not nearly fun enough. Here's another idea: what if you make the grapheme class words expand directly, at parsetime, to their value? This can be done if you use

: Any 0 parsed ; parsing
: Extend 1 parsed ; parsing
! ...

But who wants to write this eight times? I don't! We can abstract it into a parsing word CONST: to do this for us:

: define-const ( word value -- )
[ parsed ] curry dupd define-compound
t "parsing" set-word-prop ;

: CONST:
CREATE scan-word dup parsing?
[ execute dup pop ] when define-const ; parsing

But in this particular code, we use a particular pattern, similar to C's enum. Why not abstract this into our own ENUM:?

: define-enum ( words -- )
dup length [ define-const ] 2each ;

: ENUM:
";" parse-tokens [ create-in ] map define-enum ; parsing

Going back to the original problem, the ENUM: parsing word allows me to write

ENUM: Any L V T Extend Control CR LF graphemes ;

to specify the grapheme classes, without having to care about which number they correspond to.

This solution isn't perfect. The problem here is that this completely ruins homoiconic syntax for all uses of the constants. By "homoiconic syntax," I mean that see can print out the original source code, with the only real difference being whitespace. A macro using compiler transformations, which would preserve homoiconic syntax by using a different strategy, might be preferable. But I still wanted to share this little thing with you all.

Note: At the time I wrote this, this strategy made sense. But now, I'm thinking it'd be better to just go with a regular C-ENUM:, which should be renamed ENUM:. This is partly because of changes in the parser which make it more difficult to use parsing words defined in the same source file.

No comments: