I would like to start a discussion about the usage of languages and character set encodings in xindy: Current situation: 1. Currently the language-specific information (sort rules) is tied to a specific encoding. Example: The sort rules for German assume that one uses the Latin-1 encoding. If I have a raw index file in German in MS-DOS encoding, I can't use these sort rules (without having to rewrite them in MS-DOS encoding). 2. The encoding of the raw index file is not known to xindy and has to be specified in the index style. The sorting process will fail if these two informations do not coincide. 3. The mixing of different languages in one index is problematic. Example: an index of names of persons of different nationality. My proposal: 1. Separate the encoding of the raw index file from the language-specific stuff with a 2 step process: (a) (b) input in encoding "XY" --> input-encoding-independent form --> sorted index where (b) is the language-dependent sorting process For (a) I propose to use merge-rules (without :again), since those are used by xindy in the *beginning* of the sorting process. The "front-end" (a) can easily be changed for different input encodings without affecting the rest of the whole thing. For the intermediate, input-encoding-independent form, I propose to use Unicode, since it contains all characters of all languages of the world. (See also 4.) 2. The encoding of the raw index file should be specified in the file itself. Since this is a normal Lisp source file, it can contain any Lisp command. Example: ---- (input-encoding "latin-1") (indexentry :tkey (("Kunstform") ("Gesang")) :locref "1") (indexentry :tkey (("Kunstform") ("Tanz")) :locref "1") ---- Implementing this requires only a small patch to one of the Lisp sources of xindy, markup.lsp (see end of this message), and a useful definition of the command "input-encoding". With this scheme, any other information can be put into the raw file as well, language maybe, or even the whole index style. 3. The sorting rules should be written in such a way as to allow for different languages to be mixed. It might not work if there are rules which contradict each other, of course. I'm not quite sure about the right way to do it; maybe there is no universal solution. I have the following ideas: For each language, there should be 2 sets of sort rules, called for example <language>-basic and <language>-other. *-basic rule-sets describes everything that is *really* needed for processing this language (like Umlaute and sharp s in German) *-other describes everything that occasionally appears from time to time in a text and is not so important (like "à" and "é" in German). An index with languages mixed is then processed by first applying all *-basic rule-sets of all languages, and then applying all *-other rules. This should yield the best sorting result (I hope). I'd like to demonstrate the actual mixing of rules with an example: Language "AA" has this alphabet: a c e. Language "BB" has this alphabet: a b d. Sorting rules for "AA": a->1 c->3 e->5 Sorting rules for "BB": a->1 b->2 d->4 Sorting order: AA BB combined 1 a a a 2 b b 3 c c 4 d d 5 e e 4. About Unicode: The best solution seems to be to use the 8-bit Unicode transfer form (UTF-8) for file input and sorting rules. UTF-8 has several properties which make it useful here: - It consists of 8 bit characters. - Characters 0-127 are the same as in ASCII. - Most characters consist of no more than 2 bytes. - No valid byte sequence is part of any other valid byte sequence. One consequence of this: The LISP core does not need to know anything about Unicode (even though the free CLISP does!), all it needs to know is how to process strings of 8-bit characters. I hope all of this makes some sense and I really like to hear your ideas!! Appendix: Proposed patch to markup.lsp: ---- diff -ur xindy.orig/xindy-2.0d/src/markup.lsp xindy/xindy-2.0d/src/markup.lsp --- xindy.orig/xindy-2.0d/src/markup.lsp Sun Jan 25 18:10:11 1998 +++ xindy/xindy-2.0d/src/markup.lsp Sat Jan 1 20:14:01 2000 @@ -1208,13 +1208,14 @@ (let ((*readtable* idxstyle:*indexstyle-readtable*)) (idxstyle:do-require idxstyle)) (info "~&Finished reading indexstyle.") - (info "~&Finalizing indexstyle... ") - (idxstyle:make-ready idxstyle:*indexstyle*) - (info "(done)~%~%") - + (info "~&Reading raw-index ~S..." raw-index) (load raw-index :verbose nil) (info "~&Finished reading raw-index.~%~%") + + (info "~&Finalizing indexstyle... ") + (idxstyle:make-ready idxstyle:*indexstyle*) + (info "(done)~%~%") (handler-case (setq *markup-output-stream* ---- -- Thomas Henlich