Yannis Haralambous writes: > Hi, > > you may have heard about the Omega extension of TeX. We are using > 16-bit tables internally, so that in general our system is based on > Unicode. We need an indexing utility compatible with this scheme. > Would it be possible to upgrade xindy to be 16-bit compatible? > > Here is what we need: > > foo.idx files will contain characters in extended hexa notation: > > ^^^^0123^^^^abcd^^^^0080 and so on > > It should be possible to write a merge/sort file of the type > > (merge-rule "^^^^41d8" "AØ") > > where AØ is the 16-bit character of hexa value 41D8 > or to have a notation like in the newest perl (with utf8 module) > > (merge-rule "^^^^41d8" "\x{41d8}") > > This means that can (and will) be more than 256 letter groups, and > that it also should be possible to define groups of groups (Latin > entries, then Greek entries, then Cyrillic entries, and so on). > > Is this possible? If yes, in the short range? in the long range? xindy is based the CLISP implementation of Common Lisp. Additional libraries for managing regular expressions (namely the GNU Rx library) are used for the merge and sort rules. None of the listed components directly supports 16-bit Unicode characters. One could - at least to some extend - use the merge- ans sort-rules to achieve the results you need in an ad-hoc manner, though several problems might arise: - Strings are in all of the above systems null-terminates, i.e., any unicode characters of the form \x{yy00} and \x{00yy} cannot be properly handled. - Merge- and sort-rules need to be 16-bit aligned for proper operation. Currently alignment occurs only on 8-bit (character) boundaries. To give an example (merge-rule "AØ" "aØ") (don't know if that makes any sense at all) applied to the character stream "4AØ3" will result in a substitution which you probably don't want to happen. One could circumvent this by applying "boundary characters", i.e., encode the above string differently such as "4A Ø3 xy ..." but obviously you will run into other troubles then. - Another problem is the amount of rules in the substitution database. The current solution will probably not scale well if several thousands of substitution rules happen to be in the database. I can only expect that things will significantly slow down. There is an internal hash-table for efficient encoding of substitutions, which needed to be expand from 8 to 16 bit at first. Further optimization might be needed. - As you already mentioned the letter groups must be expanded to 16 bit as well. To sum all of the above considerations I think that there is a substantial amount of work to do to extend xindy from 8 to 16 bit because it orthogonally touches the inner workings of xindy dealing with keywords at almost all levels. A better approach might be to reconsider the whole model of merge- and sort-rules into a more modular architecture that models letters as objects. We have discussed some of these aspects more than a year ago on this list. Your demands actually are further arguments to rethink the whole model of merging and sorting on a character basis without higher-level concepts, which I consider to be vital for future systems. I personally will not be able to change xindy in the way needed, but I'll provide any help to others to do so. I even think that there is a lot of potential for research work in this area (at least more than enough for a computer science diploma thesis) to think about more general frameworks for this kind of problems. One thing I've learned from the xindy project is that indexing is far more complex than I ever thought and that it is hard to find good trade-offs for providing practical solutions to this problem. You are welcome to further discussions... Cheers, Roger -- ====================================================================== Roger Kehr kehr@informatik.tu-darmstadt.de Computer Science Department Darmstadt University of Technology